View Issue Details

IDProjectCategoryView StatusLast Update
0014192CentOS-7util-linuxpublic2018-05-04 07:55
Reporterkick54 
PrioritynormalSeverityminorReproducibilityalways
Status closedResolutionno change required 
PlatformHP ELitebook core i5OSCentosOS Version7.4.1708 (Core)
Product Version 
Target VersionFixed in Version 
Summary0014192: Problem with /bin/sort
DescriptionHi,

Strange behavior of sort with non-printable char.

Sort don't care of first (non-printable) char of the file, while i d'ont use -i option.

Thanks for help
Steps To Reproduce$ echo -e "\001 x\n\002 y\n\003 b">f
$ od -c f
0000000 001 x \n 002 y \n 003 b \n
0000014
$ sort f
 b
 x
 y
 
TagsNo tags attached.
abrt_hash
URL

Activities

N3WWN

N3WWN

2018-05-03 19:57

reporter   ~0031713

Believe it or not, this is not a bug. :)

What you're running into is due to Unicode character encoding and collation.

We can see this by turning on sort debugging:

~~~
$ sort --debug f
sort: using ‘en_US.UTF-8’ sorting rules
 b
__
 x
__
 y
__
~~~

This tells us a few things.

First, it tells us that the en_US.UTF-8 sorting rules are in effect (more on that in a bit).

Second, we can see how many characters are used to perform the sorting by the number of underscores in each line of output. Since there are two underscores, we know it's sorting on the 2nd character in each line, which are the letters, hence why the output is in alphabetical order of the printable characters.

Getting back to the sorting rules, "UTF-8 is a variable width character encoding capable of encoding all 1,112,064 valid code points in Unicode using one to four 8-bit bytes." ( Wikipedia https://en.wikipedia.org/wiki/UTF-8 )

Unicode uses a "Collation Element Table" to determine sort order. The Unicode Collation Element Table (DUCET) is available at ftp://unicode.org/Public/UCA/latest/allkeys.txt .

We can examine this file to find out the collation values for each of the bytes in our file:

~~~
$ egrep '# LATIN SMALL LETTER [BXY]$|^000[1-3]' allkeys.txt
0001 ; [.0000.0000.0000.0000] # [0001] START OF HEADING (in 6429)
0002 ; [.0000.0000.0000.0000] # [0002] START OF TEXT (in 6429)
0003 ; [.0000.0000.0000.0000] # [0003] END OF TEXT (in 6429)
0062 ; [.15EA.0020.0002.0062] # LATIN SMALL LETTER B
0078 ; [.1860.0020.0002.0078] # LATIN SMALL LETTER X
0079 ; [.1865.0020.0002.0079] # LATIN SMALL LETTER Y
~~~

The collation weight of the first byte on each line is "[.0000.0000.0000.0000]", so sort must use the next byte on each line to determine the sort order.

This explains the output you are seeing.

Now, to get the output that you are expecting, you can tell sort to use simple byte comparisons by specifying LC_ALL to override all localization settings:

~~~
$ LC_ALL=C sort --debug f
sort: using simple byte comparison
 x
__
 y
__
 b
__
~~~

Turning off debug, we can see exactly what you expected to see:

~~~
$ LC_ALL=C sort f
 x
 y
 b
~~~

...and if we tell sort to ignore the non-printable characters, we still get what we expected to see:

~~~
$ LC_ALL=C sort -i ~/f
 b
 x
 y
~~~

Hopefully this helps clear up what is happening here!

-Rich Alloway (Rogue Wave)
kick54

kick54

2018-05-04 07:32

reporter   ~0031717

ok that's clear

Thank you very much for spending time on my question !!!

-Christian Duclou

Issue History

Date Modified Username Field Change
2017-11-26 17:29 kick54 New Issue
2018-05-03 19:57 N3WWN Note Added: 0031713
2018-05-04 07:32 kick54 Note Added: 0031717
2018-05-04 07:55 TrevorH Status new => closed
2018-05-04 07:55 TrevorH Resolution open => no change required