C program read and write unicode files

swanfu · Post time: 2020-1-15 18:20:01

I encountered a problem in the development of a C program that reads and writes unicode files (road.txt is part of the UTF-16 file).

road.txt

10 3938 CH Tuen Mun EN TUEN MUN
10 3939 CH Islands EN ISLANDS
10 3942 CH East EN EASTERN
10 3955 CH Victoria Harbour EN VICTORIA HARBOUR
10 3956 CH Peng Chau EN PENG CHAU ISLAND
10 4023-1 CH Kowloon Bay EN KOWLOON BAY
10 14000 CH Wo Fung Street EN WO FUNG

Program development requirements:

The file road.txt is a UTF-16 file. To read the file in C, get the ID (such as 3938), the language (such as CH), and the place name (such as Tuen Mun). Stored in a structure: id and language data type is ascii string; place name data type is utf-16

The problem i encountered

1. After successfully opening the file (using fopen), I cannot use fgetwc to read the UTF-16 characters in the stream with two bytes and two bytes. Each time I use fgetwc to read one byte.
2. Because I can't read two bytes at a time, I use fgetc twice to read a character, but when reading some characters, (such as "Victoria Harbor" in the hex encoding is AD 7D 1A 59 29 52 9E 4E 2F 6E, but when reading 1A, somehow the reading cycle ends, it seems that 1A is read as EOF.)

Thank you for pointing me out =-)

zjwzjw · Post time: 2020-1-26 14:00:01

1A ...

zjwzjw · Post time: 2020-1-26 16:09:01

File reading (avoid reading half a character):

/ * ----------------------- Own_lib library series mbsfgets.cc ------------------ -------
cy_mbsfgets () /mbsfgets.o /libcyfunc.a

Description: When processing multi-byte text files (such as Chinese), the standard fgets () function can only read in bytes, which is easy
      Read half a character and the read string is incomplete. To this end, some additional processing code needs to be added to the program.
      But this makes the code look less concise.
      This function attempts to encapsulate this process, making the reading of East Asian text more straightforward and the code more concise. Function
      The numbers use the "mbstowcs () / wcstombs ()" standard library functions, so you should setlocale () at the beginning of your program.
      Similar to the "fgets ()" standard library function, newline characters at the end of lines in a file are read in the same way as other characters.
      Similarly, a '\0' (null character) terminator is appended to the end of the string.
      When the user calls the function: pass the destination character with enough (MB_CUR_MAX * max_chars +1) memory space
      String address (pointer) | Maximum number of characters (not bytes) | File stream pointer | and "LC_CTYPE" locale characters
      String (such as: "zh_CN.GBK", the default value is NULL, which means that the system settings are not changed. Also when an invalid
      When the locale is set, the function does not change the current setting, but the "setting" is already true!).
      The next function is an overloaded function, adding a reference variable that returns the actual number of bytes of the read string and returns the actual read
      A reference variable for the number of characters in the string.
dest_str:
      A pointer and / or NULL value holding the memory space read into the string
max_chars:
      The maximum number of characters to read from the file stream.
stream:
      A file stream pointer for the FILE structure.
[length]:
      Returns the number of bytes of the actual read string.
[chars]:
      Returns the number of characters that actually read the string.
locale_ctype = NULL:
      Read in the LC_CTYPE field (language / encoding) to which the string belongs. The general program has been set at the beginning, so it can be ignored
return value:
      Returns a pointer to the string of dest_str memory space ('\0' has been added to the end). If there is an error, it returns NULL

Note! 1. The function reads a string from stream until a newline appears or reaches the end of the file or has read max_chars characters.
      2. When passing the LC_CTYPE locale, the parameter dest_str should be NULL, the function returns to point to the free storage area
         String pointer (should delete [] after assignment and assign NULL), if the parameter dest_str is not NULL,
         The function moves the file stream pointer by one line and / or the specified number of characters (when less than one line), and returns NULL.
         When the LC_CTYPE setting is not passed, and the parameter dest_str is set to NULL, its behavior is true as LC_CTYPE
         &&dest_str is true (the empty string "" can be passed) (only the file stream pointer moves&&returns NULL)
      3. The function will temporarily change the setting of "LC_CTYPE" after passing the locale string. It will restore its original value when exiting.
         Here comes the setting.

Author: Ren Xiao | 2002.05.08.
Copyright: GNU General (Library) Public License (GPL / LGPL)

* Editor: vim-6.0 | Operating System: TurboLinux7.0 Simplified Chinese Edition *
-------------------------------------------------- ---------------------------- * /

#include <stdio.h> // usr for fgets ()
#include <stdlib.h> // usr for MB_CUR_MAX --The maximum byte length per word in the current multibyte environment.
// usr for mbs * and others.
#include <locale.h> // usr for setlocale () .-- The function provides the opportunity to change the environment "LC_CTYPE"

#include "cyinclude / cyfget.h"

char * cy_mbsfgets (char * dest_str, int max_chars, FILE * stream,
                    const char * locale_ctype = NULL)
{
long fpos_before;
if ((fpos_before = ftell (stream)) == -1) // Get the file stream pointer position before reading
return NULL;

int num = max_chars * MB_CUR_MAX;
bool locale_check = false;
char * locale_original = setlocale (LC_CTYPE, NULL); // LC_CTYPE original value
if (locale_ctype)
{
if (! dest_str)
{
setlocale (LC_CTYPE, locale_ctype);
num = max_chars * MB_CUR_MAX;
locale_check = true;
}
else
dest_str = NULL;
}

size_t test_length = 0; // used to test whether the number of input characters exceeds the requirement
char * tmp_chars = new char [num + 1]; // the maximum possible space for the string, for temporary storage
wchar_t * tmp_wchars = new wchar_t [max_chars + 1];

if ((! tmp_chars) || (! tmp_wchars))
{// One of them may have been successfully applied!
if (locale_check)
setlocale (LC_CTYPE, locale_original);
delete [] tmp_chars;
tmp_chars = NULL;
delete [] tmp_wchars;
tmp_wchars = NULL;
return NULL;
}

if (! fgets (tmp_chars, num + 1, stream)) // receive the maximum number of input strings
{
if (locale_check)
setlocale (LC_CTYPE, locale_original);
delete [] tmp_chars;
tmp_chars = NULL;
delete [] tmp_wchars;
tmp_wchars = NULL;
return NULL;
}

test_length = mbstowcs (tmp_wchars, tmp_chars, max_chars + 1);
if ((test_length == (size_t) -1) || (test_length == max_chars + 1))
tmp_wchars [max_chars] = L'\0 '; // Extra characters or illegal bytes are truncated or overwritten

size_t chars_length; // Actual read string length (used to set file stream pointer offset)
char * return_chars = NULL;
if (locale_check)
{
chars_length = wcstombs (NULL, tmp_wchars, num + 1);
return_chars = new char [chars_length + 1];
if (! return_chars)
{
setlocale (LC_CTYPE, locale_original);
delete [] tmp_chars;
tmp_chars = NULL;
delete [] tmp_wchars;
tmp_wchars = NULL;
return NULL;
}
wcstombs (return_chars, tmp_wchars, chars_length + 1);
}
else
chars_length = wcstombs (dest_str, tmp_wchars, num + 1);
Ranch
if (locale_check)
setlocale (LC_CTYPE, locale_original);

delete [] tmp_chars;
tmp_chars = NULL;
delete [] tmp_wchars;
tmp_wchars = NULL;
Ranch
if (fseek (stream, fpos_before + chars_length, SEEK_SET))
return NULL; // reset the file stream pointer, return NULL when there is an error

if (locale_check)
return return_chars;
else
return dest_str;
}
// ------------------------------------------------ -----------------------------

// Overloaded function of the previous function.

char * cy_mbsfgets (char * dest_str, int max_chars, FILE * stream,
                    int&length, int&chars, const char * locale_ctype = NULL)
{
long fpos_before;
if ((fpos_before = ftell (stream)) == -1) // Get the file stream pointer position before reading
return NULL;

int num = max_chars * MB_CUR_MAX;
bool locale_check = false;
char * locale_original = setlocale (LC_CTYPE, NULL); // LC_CTYPE original value
if (locale_ctype)
{
if (! dest_str)
{
setlocale (LC_CTYPE, locale_ctype);
num = max_chars * MB_CUR_MAX;
locale_check = true;
}
else
dest_str = NULL;
}

size_t test_length = 0; // used to test whether the number of input characters exceeds the requirement
char * tmp_chars = new char [num + 1]; // the maximum possible space for the string, for temporary storage
wchar_t * tmp_wchars = new wchar_t [max_chars + 1];

if ((! tmp_chars) || (! tmp_wchars))
{// One of them may have been successfully applied!
if (locale_check)
setlocale (LC_CTYPE, locale_original);
delete [] tmp_chars;
tmp_chars = NULL;
delete [] tmp_wchars;
tmp_wchars = NULL;
return NULL;
}

if (! fgets (tmp_chars, num + 1, stream)) // receive the maximum number of input strings
{
if (locale_check)
setlocale (LC_CTYPE, locale_original);
delete [] tmp_chars;
tmp_chars = NULL;
delete [] tmp_wchars;
tmp_wchars = NULL;
return NULL;
}

test_length = mbstowcs (tmp_wchars, tmp_chars, max_chars + 1);
if ((test_length == (size_t) -1) || (test_length == max_chars + 1))
{
tmp_wchars [max_chars] = L'\0 '; // Extra characters or illegal bytes are truncated or overwritten
chars = max_chars; // !. Passing by reference returns the actual number of string characters
}
else
chars = test_length; // !. Passing by reference returns the actual number of string characters

size_t chars_length; // Actual read string length, used to set file stream pointer offset and return value
char * return_chars = NULL;
if (locale_check)
{
chars_length = wcstombs (NULL, tmp_wchars, num + 1);
return_chars = new char [chars_length + 1];
if (! return_chars)
{
setlocale (LC_CTYPE, locale_original);
delete [] tmp_chars;
tmp_chars = NULL;
delete [] tmp_wchars;
tmp_wchars = NULL;
return NULL;
}
wcstombs (return_chars, tmp_wchars, chars_length + 1);
length = chars_length; // !. Passing by reference returns the actual string bytes
}
else
{
chars_length = wcstombs (dest_str, tmp_wchars, num + 1);
length = chars_length; // !. Passing by reference returns the actual string bytes
}
Ranch
if (locale_check)
setlocale (LC_CTYPE, locale_original);

delete [] tmp_chars;
tmp_chars = NULL;
delete [] tmp_wchars;
tmp_wchars = NULL;
Ranch
if (fseek (stream, fpos_before + chars_length, SEEK_SET))
return NULL; // reset the file stream pointer, return -1 when there is an error
Ranch
if (locale_check)
return return_chars;
else
return dest_str;
}

zjwzjw · Post time: 2020-1-26 16:45:01

/ * ------------------------- Own_lib library series utf8to.cc ---------------- --------
cy_utf8to () /utf8to.o /libcyfunc.a

Description: Use the iconv () function to convert a UTF-8 encoded byte sequence to any encoding string supported by the system.
      The next function is an overloaded function that adds two parameter references to return the actual number of bytes converted.
out_buf:
      A pointer to the destination string space for the conversion.
buf_len:
      The size of the memory space containing the target string (including the '\0' character space at the end), calculated in bytes.
out_code:
      The encoding of the conversion destination string.
in_str:
      A pointer to the source string to be converted (should be a string with the normal terminator '\0' at the end).
[in_len]:
      Used to return the number of bytes (excluding the terminator '\0') of the source string (UTF-8 encoded) that has been converted.
conv_begin = false:
      For encoding systems that have a "state change", when processing consecutive segmented strings, set it to true for the first time.
return value:
      Returns the number of bytes of the converted target encoding string. If an error is encountered, -1 is returned, if an illegal byte sequence is encountered
      Or buf_len is insufficient (there are non-null characters in in_str to be converted), and 0 is returned.

Note! 1. When using the iconv () function to convert different encoding strings, it has nothing to do with the system LC_CTYPE / locale.
      2. When encountering an illegal byte sequence or buf_len is less than the length required to convert the string and returns 0, dest_str
         A string that has been converted normally is stored in it (the '\0' terminator is already in the correct position of the string).

Author: Ren Xiao | 2002.05.26.
Copyright: GNU General (Library) Public License (GPL / LGPL)

* Editor: vim-6.0 | Operating System: TurboLinux7.0 Simplified Chinese Edition *
-------------------------------------------------- ---------------------------- * /

#include <string.h> // use for strlen ()
#include <iconv.h> // use for iconv_open () / iconv () iconv_close ()

#include "cyinclude / cyutf.h"

size_t cy_utf8to (char * out_buf,
                  size_t buf_len,
                  const char * out_code,
                  const char * in_str,
                  bool conv_begin = false)
{
static iconv_t its_conv;
if ((its_conv = iconv_open (out_code, "UTF-8")) == (iconv_t) -1)
return (size_t) -1;
if (! in_str)
return (size_t) -1;
if (conv_begin)
iconv (its_conv, NULL, NULL, NULL, NULL);

const char * instr = in_str;
size_t inlen = strlen (in_str) + 1;
char * outstr = out_buf;
size_t outlen = buf_len-1;
size_t ret_conv = 0;
ret_conv = iconv (its_conv,&instr,&inlen,&outstr,&outlen);
// When outlen is not long enough, iconv () returns -1
iconv_close (its_conv);

if ((ret_conv == (size_t) -1) || (inlen == 1))
outstr [0] = '\0'; // Supplemental terminator '\0' at the end of the string

if (inlen == 1) // this sentence must be before the next sentence
return (buf_len-1-outlen);
if (ret_conv == (size_t) -1)
return 0;

return (buf_len-2-outlen);
}
// ------------------------------------------------ ------------------------------

// Overloaded function of the previous function

size_t cy_utf8to (char * out_buf,
                  size_t buf_len,
                  const char * out_code,
                  const char * in_str,
                  size_t&in_len, //!
                  bool conv_begin = false)
{
static iconv_t its_conv;
if ((its_conv = iconv_open (out_code, "UTF-8")) == (iconv_t) -1)
return (size_t) -1;
if (! in_str)
return (size_t) -1;
if (conv_begin)
iconv (its_conv, NULL, NULL, NULL, NULL);

const char * instr = in_str;
size_t inlen, save_len;
inlen = save_len = strlen (in_str) + 1;
char * outstr = out_buf;
size_t outlen = buf_len-1;
size_t ret_conv = 0;
ret_conv = iconv (its_conv,&instr,&inlen,&outstr,&outlen);

iconv_close (its_conv);

if (inlen == 0)
in_len = (save_len-1)-inlen;
else
in_len = save_len-inlen;

if ((ret_conv == (size_t) -1) || (inlen == 1))
outstr [0] = '\0'; // Supplemental terminator '\0' at the end of the string

if (inlen == 1) // this sentence must be before the next sentence
return (buf_len-1-outlen);
if (ret_conv == (size_t) -1)
return 0;

return (buf_len-2-outlen);
}

swanfu · Post time: 2020-2-9 18:00:02

Thank you very much! !!

But I need to wrap a user function myself.
Read MULTI BYTE first, then convert to WIDE CHARACTER? But with fgetwc I can't read MULTI BYTE.

2. The problem of 0X 1A does not seem to be me.

http://community.csdn.net/Expert/topic/5214/5214784.xml?temp=.6545221
Main question: Unicode text file reading?
Author: guochun (yingc)

axx1611 (long long *&ago) () Reputation: 100 Blog 2006-12-8 12:08:58
  Opened VC6 today and looked at it ~~ The ASCII code of "No" contains 0x1a
  fgetc () thinks the file is over, for unknown reasons
  gcc doesn't have this problem (VC6 bug ??)
  There should be no problem with fread

axx1611 (long long *&ago) () Reputation: 100 Blog 2006-12-8 12:38:21
  I was wrong.
  Also tried VC8 also had this problem
  Personally think that here is still used binary processing
  PS. Fgetwc also reads ANSI, but it automatically converts the read ANSI into Unicode and returns, so it is useless.

It's strange, but I can read it after using setmode as binary.

althes · Post time: 2020-3-7 22:15:01

0x1A is the end tag of the ASCII character file.You read byte by byte. When it encounters 0x1A, it certainly thinks that the file is over.

althes · Post time: 2020-3-11 16:00:01

int i = 0;
    wchar_t buf [100];

    FILE * input = fopen ("c:\\unicode.txt", "rb");
    while (! feof (input))
    {
        buf [i ++] = fgetwc (input);
    }
    buf [i-1] = 0;
    wcout.imbue (locale ("chs"));
    wcout << (buf + 1) << L "中国\n";

		Remember me	Forgot password?
Password			Register