View Issue Details
|ID||Project||Category||View Status||Date Submitted||Last Update|
|0018317||CentOS-8||glibc||public||2021-09-30 17:35||2022-01-09 02:03|
|Summary||0018317: iconv silently corrupts data|
|Description||Using iconv for code page conversion, results in corrupted output, when "-c" flag (discard characters that cannot be converted) is used for input where characters that *can* and *cannot* be converted appear together, |
The issue only manifests for rather large inputs (presumably > 32K).
There is no error or warning, results are just broken.
|Steps To Reproduce||Open bash and run:|
>perl -E 'say "\x58\xe2\x58\xc3\x92\x58\xe2\x58\x58\xe2\x58\xc3\x92\x58\xe2\x58\n" x 15000' | iconv -c -f ISO-8859-3 -t UTF-8 | sort | uniq -c
It creates 15000 lines of mixed "X", ISO-8859-3-convertable \xe2 and ISO-8859-3-unconvertable \xc3\x92, which is fed into iconv for convertion to UTF8.
I expect \xe2 to be converted, \xc3\x92 to be dropped (because of "-c") and in any case, all lines to be equal.
Something like this:
However I get *a mix* of broken lines.
I.e. the actual output is:
> 2 XXâX�XâX
> 2 XâX�XXâX
> 2 XâX�XâX
> 1 XâX�XâXX
> 2 XâX�XâXXâX�X�XâXXâX�XâX
> 14917 XâX�XâXXâX�XâX
As can be seen, many lines just disappear (14917+2+1+2+2+2+1 don't sum up to 15000).
|Additional Information||Actual specific input does not matter, as long as it has a mix of convertable and non-convertable characters.|
Reducing number of input lines to smaller number (ex. 1000) and all works as expected:
I tried this for ISO-8859-3 and ISO-8859-8 (same input) with similar (wrong) results.
Results are broken in latest CentOS8.4, RHEL8.4, as well as CentOS6.10
Using piconv (Perl variant of iconv) instead of iconv produces correct results.
|Tags||codepage, glibc, iconv|