Begging for answers! How do I convert text to unicode encoding in a C ++ program?

肉眼牛排 · Post time: 2020-1-25 11:20:01

I encountered a problem during the programming process: As the ASC code table can only represent English letters and punctuation, but not Chinese characters, I know that unicode can solve this problem, but I do n’t know too much about unicode, and I feel I ca n’t start. Could you explain the following related knowledge? It's better to have relevant code to understand, thank you all here!

micheal335 · Post time: 2020-2-13 23:00:01

The biggest problem with ASCII is the first letter of the abbreviation. ASCII is a true American standard, so it does not meet the needs of other English-speaking countries. For example, where is the British pound symbol (￡)?

English uses the Latin (or Roman) alphabet. In writing languages using the Latin alphabet, words in English often require little accents (or accents). Even those traditional English words with diacritics are not inappropriate, such as c 鰋 perate or résumé, and no diacritics in the spelling will be fully accepted.

But in many countries south of the United States, north of the United States, and in the Atlantic, the use of phonetic symbols is common in languages. These accents were originally made to adapt the Latin alphabet to the different needs of these languages. Traveling in the Far East or the South of Western Europe, you will encounter languages that do not use Latin at all, such as Greek, Hebrew, Arabic, and Russian (using the Slavic alphabet). If you go further east, you will find Chinese hieroglyphs, and Japan and North Korea also use Chinese characters.

The history of ASCII began in 1967, and since then it has focused on overcoming its own limitations to better suit other languages other than American English. For example, in 1967, the International Standards Organization (ISO) recommended an ASCII variant with the codes 0x40, 0x5B, 0x5C, 0x5D, 0x7B, 0x7C, and 0x7D "reserved for national use," and the codes 0x5E, 0x60, and 0x7E are It is "can be used for other graphic symbols when special characters required by China require 8, 9 or 10 space positions". This is obviously not the best international solution because it does not guarantee consistency. But this shows how people try to code for different languages.

Extended ASCII

In the early days of small computer development, 8-bit bytes were strictly established. So if one byte is used to hold the character, 128 additional characters are needed to complement the ASCII. In 1981, when the original IBM PC was launched, the ROM of the video card was burned with a character set that provided 256 characters, which also became an important part of the IBM standard.

The original IBM extended character set included certain accented characters and a lowercase Greek alphabet (very useful in mathematical symbols), as well as block and line graphic characters. Additional characters are also added to the encoding position of the ASCII control characters, because most control characters are not used for display.

The IBM extended character set was burned into the ROMs of countless display cards and printers, and used by many applications to modify its text mode display. However, this character set does not provide enough accented characters for all Western European languages that use the Latin alphabet, and it is not available for Windows. Windows does not need graphic characters because it has a completely graphical system.

So far we have seen a character set of 256 characters. But there are about 21,000 hieroglyphic symbols in China, Japan, and South Korea. How to accommodate these languages while still maintaining some compatibility with ASCII?

The solution (if this is true) is the double-byte character set (DBCS). DBCS starts with 256 code, just like ASCII. As with any well-behaved code page, the first 128 codes were ASCII. However, some of the higher 128 codes always follow the second byte. These two bytes together (called the first byte and the following byte) define a character, usually a complex pictograph.

Although Chinese, Japanese, and Korean share some of the same hieroglyphs, obviously the three languages are different, and often the same hieroglyph represents three different things in three different languages. Windows supports four different double-byte character sets: code page 932 (Japanese), 936 (Simplified Chinese), 949 (Korean), and 950 (Traditional Chinese). DBCS is only supported for Windows versions produced for these countries.

The double character set problem does not mean that characters are represented by two bytes. The problem is that some characters (especially ASCII characters) are represented by 1 byte. This can cause additional programming problems. For example, the number of characters in a string cannot be determined by the number of bytes in the string. The string must be parsed to determine its length, and each byte must be checked to determine if it is the first byte of a double-byte character. If there is a pointer to the middle of a DBCS string, what is the address of the previous character of the string? The idiomatic solution is to parse the string from the beginning pointer!

Unicode solution

The basic problem we face is that writing languages in the world cannot be simply represented by 256 8-bit codes. Previous solutions including code pages and DBCS have proven to be unsatisfactory and clumsy. So what is the real solution?

As programmers, we have experienced this type of problem. If there are too many things that cannot be represented with 8-bit values, then we try wider values, such as 16-bit values. And this is very interesting, which is why Unicode was developed. Unlike the chaotic 256-character code map and the double-byte character set containing some 1-byte codes and some 2-byte codes, Unicode is a unified 16-bit system, which allows 65,536 characters to be represented. This is ample for representing all characters and languages that use hieroglyphs in the world, including a collection of mathematical, symbolic, and monetary unit symbols.

It is important to understand the difference between Unicode and DBCS. Unicode uses (especially in the C programming language environment) the "wide character set". "Every character in Unicode is 16 bits wide instead of 8 bits wide." In Unicode, there is no meaning of using only 8-bit numeric values. In contrast, we still deal with 8-bit values in double-byte character sets. Some bytes define characters by themselves, and some bytes indicate that a character needs to be defined with another byte.

Working with DBCS strings is very messy, but working with Unicode text is like working with ordered text. You might be happy to know that the first 128 Unicode characters (16-bit codes from 0x0000 to 0x007F) are ASCII characters, and the next 128 Unicode characters (codes from 0x0080 to 0x00FF) are ISO 8859-1 extensions to ASCII. Characters in different parts of Unicode are also based on existing standards. This is for ease of conversion. The Greek alphabet uses codes from 0x0370 to 0x03FF, Slavic uses codes from 0x0400 to 0x04FF, the United States uses codes from 0x0530 to 0x058F, and Hebrew uses codes from 0x0590 to 0x05FF. Chinese, Japanese, and Korean hieroglyphs (collectively called CJK) occupy codes from 0x3000 to 0x9FFF.

From <Windows Programming>

肉眼牛排 · Post time: 2020-2-15 07:00:01

Thank you for your answer, is there any case procedure for reference? It's really a bit abstract ... thank you

肉眼牛排 · Post time: 2020-2-15 15:45:02

And now the biggest problem is that I do n’t know how to use unicode, how to convert text and its encoding, how its encoding takes up space .. 唉 .. And it is really anxious, otherwise it will not be posting so late. . Please my dear friends, thank you very much

firegun · Post time: 2020-2-29 19:45:01

To use unicode, select c / c ++ under project-> setting, then add UNICODE under preprocessor definitions, so you are using unicode. To use unicode, you can use TEXT before the string, such as
TEXT ("abcd");
window will automatically convert between unicode and ASCII depending on whether UNICODE is set

micheal335 · Post time: 2020-3-5 17:00:02

Add #define UNICODE
TCHAR c [] = TEXT ("ABCD");

diamond52 · Post time: 2020-3-5 19:30:01

multibytestowidwchar ()
This api can

肉眼牛排 · Post time: 2020-7-7 11:30:01

Thanks everyone for your help

		Remember me	Forgot password?
Password			Register