0018317CentOS-8glibcpublic2021-09-30 17:35
Reportersoko246 Assigned To 
Status newResolutionopen 
Product Version8.4.2105 
Summary0018317: iconv silently corrupts data
DescriptionUsing iconv for code page conversion, results in corrupted output, when "-c" flag (discard characters that cannot be converted) is used for input where characters that *can* and *cannot* be converted appear together,
The issue only manifests for rather large inputs (presumably > 32K).
There is no error or warning, results are just broken.
Steps To ReproduceOpen bash and run:
>export LANG=C
>perl -E 'say "\x58\xe2\x58\xc3\x92\x58\xe2\x58\x58\xe2\x58\xc3\x92\x58\xe2\x58\n" x 15000' | iconv -c -f ISO-8859-3 -t UTF-8 | sort | uniq -c

It creates 15000 lines of mixed "X", ISO-8859-3-convertable \xe2 and ISO-8859-3-unconvertable \xc3\x92, which is fed into iconv for convertion to UTF8.
I expect \xe2 to be converted, \xc3\x92 to be dropped (because of "-c") and in any case, all lines to be equal.
Something like this:
>15000 XâX�XâXXâX�XâX

However I get *a mix* of broken lines.
I.e. the actual output is:
> 1
> 2 XXâX�XâX
> 2 XâX�XXâX
> 2 XâX�XâX
> 1 XâX�XâXX
> 2 XâX�XâXXâX�X�XâXXâX�XâX
> 14917 XâX�XâXXâX�XâX

As can be seen, many lines just disappear (14917+2+1+2+2+2+1 don't sum up to 15000).
Additional InformationActual specific input does not matter, as long as it has a mix of convertable and non-convertable characters.
Reducing number of input lines to smaller number (ex. 1000) and all works as expected:
>1000 XâX�XâXXâX�XâX

I tried this for ISO-8859-3 and ISO-8859-8 (same input) with similar (wrong) results.
Results are broken in latest CentOS8.4, RHEL8.4, as well as CentOS6.10

Using piconv (Perl variant of iconv) instead of iconv produces correct results.
Tagscodepage, glibc, iconv


