View Issue Details

IDProjectCategoryView StatusLast Update
0001511CentOS-4Otherpublic2006-09-28 20:25
Status closedResolutionopen 
Product Version4.4 - i386 
Target VersionFixed in Version 
Summary0001511: bash shell produces incorrect results for regular expression [A-Z]*
DescriptionMetacharacters [A-Z]* are not interpreted correctly. This appears universal in the bash shell (c shell does not have this issue). Irrespective of the state of nocaseglob, lower case characters match in the above expression.

Using ls to illustrate:
ciscorh Fri> ls
bin bldstat2.sql chg Desktop ls mail NB-BILL2 s.sql
ciscorh Fri> shopt nocaseglob
nocaseglob off
ciscorh Fri> ls -d [A-Z]*
bin bldstat2.sql chg Desktop ls mail NB-BILL2 s.sql
ciscorh Fri> ls -d [DN]*
Desktop NB-BILL2
ciscorh Fri> ls -d [D-N]*
Desktop ls mail NB-BILL2

In the above example, on the [DN]* pattern works as documented.
Turning nocaseglob on, gives the same results ...

ciscorh Fri> shopt -s nocaseglob
ciscorh Fri> shopt nocaseglob
nocaseglob on
ciscorh Fri> ls -d [A-Z]*
bin bldstat2.sql chg Desktop ls mail NB-BILL2 s.sql

Additional InformationClassified as major, however this depends on the user. For folks like me who rely on hundreds of scripts written over 20 years, it is a major problem, even though [[:upper:]]* is a viable work around. For others, this may be considered minor.

man bash ...
An additional binary operator, =~, is available, with the same precedence as == and !=. When it is used, the string to the right of the operator is considered an extended regular expres- sion and matched accordingly (as in regex(3)). The return value is 0 if the string matches the pattern, and 1 otherwise. If the regular expression is syntactically incorrect, the con- ditional expressionâs return value is 2. If the shell option nocaseglob is enabled, the match is performed without regard to the case of alphabetic characters. Substrings matched by parenthesized subexpressions within the regular expression are saved in the array variable BASH_REMATCH. The element of BASH_REMATCH with index 0 is the portion of the string matching the entire regular expression. The element of BASH_REMATCH with index n is the portion of the string matching the nth parenthesized subexpression.


nocaseglob - If set, bash matches filenames in a case-insensitive fashion when performing pathname expansion (see Path- name Expansion above).


If the shell option nocaseglob is enabled, the match is performed without regard to the case of alpha- betic characters. When a pattern is used for pathname expansion,

OS ...
ciscorh Fri> lsb_release -a
LSB Version: :core-3.0-ia32:core-3.0-noarch:graphics-3.0-ia32:graphics-3.0-noarch
Distributor ID: CentOS
Description: CentOS release 4.4 (Final)
Release: 4.4
Codename: Final
ciscorh Fri> echo $BASH_VERSION

TagsNo tags attached.




2006-09-17 00:39

reporter   ~0003950

Additional information is available in the CentOS General Support forum under the topic "bash shell problem with regular expressions".


2006-09-17 11:48

reporter   ~0003954

Which can be found here


2006-09-28 16:28

reporter   ~0004050

What character set is used in your terminal/shell? It is bad to rely on character sequences like [A-Z], because characters may be ordered differently in non ISO-8859-* sets. Does the problem persist after running

export LC_ALL=C

or even:

export LC_ALL=en_US.ISO8859-1



2006-09-28 17:20

reporter   ~0004053

The character set used is ISO-8859-1. Since the problem is specific to the bash shell, I don't see how the character set could impact the problem.

ciscorh Thu> export LC_ALL=en_US.ISO8859-1
ciscorh Thu> ls
bin bldstat2.sql chg Desktop ls mail NB-BILL2 s.sql

To further illustrate:

ciscorh Thu> bash
[bill@ciscorh ~]$ ls -d [A-Z]*
bin bldstat2.sql chg Desktop ls mail NB-BILL2 s.sql
[bill@ciscorh ~]$ exit
ciscorh Thu> csh
[bill@ciscorh ~]$ ls -d [A-Z]*
Desktop NB-BILL2
[bill@ciscorh ~]$ exit
ciscorh Thu> ksh
$ ls -d [A-Z]*
Desktop NB-BILL2


2006-09-28 17:52

reporter   ~0004054

Last edited: 2006-09-28 17:52

The character set certainly has an impact on this. This is on one of my CentOS 4.4 machines:

[daniel@creampuff test]$ ls -l
total 20
drwxrwxr-x 2 daniel daniel 4096 Sep 28 19:44 Apple
-rw-rw-r-- 1 daniel daniel 0 Sep 28 19:44 orange
drwxrwxr-x 2 daniel daniel 4096 Sep 28 19:44 pear
[daniel@creampuff test]$ ls -d [A-Z]*
Apple orange pear
[daniel@creampuff test]$ export LC_ALL=C
[daniel@creampuff test]$ ls -d [A-Z]*

Sorry for suggesting ISO 8859-1. 'C' should do the trick. The reason for this happening is documented in various places, including the "Bash beginner's guide":

"Within a bracket expression, a range expression consists of two characters separated by a hyphen. It matches any single character that sorts between the two characters, inclusive, using the locale's collating sequence and character set. For example, in the default C locale, "[a-d]" is equivalent to "[abcd]". Many locales sort characters in dictionary order, and in these locales "[a-d]" is typically not equivalent to "[abcd]"; it might be equivalent to "[aBbCcDd]", for example. To obtain the traditional interpretation of bracket expressions, you can use the C locale by setting the LC_ALL environment variable to the value "C""


That explains pretty much what is going on. So, it's not a bug in bash, just locale-specific character set ordering.



2006-09-28 19:51

reporter   ~0004056

Thank you!

I should have tried LC_ALL=C first. It does indeed work.

However, I respectfully sumbit that if this is not a bug, then the man page for bash is deficient since nocaseglob documentation makes no reference to this requirement. In addition, seems to me that the CentOS distro should set this varible by default.


2006-09-28 19:57

administrator   ~0004057

No. I require de_DE.UTF-8 in my shells - other people in other countries might require other settings. The world doesn't revolve around ASCII ...

So it is *not* a good idea to set LC_ALL to C.


2006-09-28 20:10

reporter   ~0004058

Last edited: 2006-09-28 20:11

I agree with range. The Single Unix Specification version 3 advises against using range expressions, and predefines commonly-used ranges with a symbolic name. In other words, it is mostly a depracated construct and should not be used.

Besides that UTF-8 is used for many applications these days, and is used by many people who don't have English as their native language. So setting the locales back to C would be a step back.



2006-09-28 20:25

administrator   ~0004059

Okay, closing this "bug", as it is not an issue with the bash which comes with CentOS.

Issue History

Date Modified Username Field Change
2006-09-15 18:41 bgravatt New Issue
2006-09-15 18:41 bgravatt Status new => assigned
2006-09-17 00:39 bgravatt Note Added: 0003950
2006-09-17 11:48 BillMaltby Note Added: 0003954
2006-09-28 16:28 danieldk Note Added: 0004050
2006-09-28 17:20 bgravatt Note Added: 0004053
2006-09-28 17:52 danieldk Note Added: 0004054
2006-09-28 17:52 danieldk Note Edited: 0004054
2006-09-28 19:51 bgravatt Note Added: 0004056
2006-09-28 19:57 range Note Added: 0004057
2006-09-28 20:10 danieldk Note Added: 0004058
2006-09-28 20:11 danieldk Note Edited: 0004058
2006-09-28 20:25 range Status assigned => closed
2006-09-28 20:25 range Note Added: 0004059