regex/replace white list

**RobG** · Feb 17 '06, 01:15 AM

Re: regex/replace white list

jgabbai@gmail.c om wrote:[color=blue]
> Hi,
>
> What is the best way to white list a set of allowable characters using
> regex or replace? I understand it is safer to whitelist than to
> blacklist, but am not sure how to go about it.[/color]

Whether to use a white list (i.e. list of allowed characters) or a black
list (list of not allowed characters) is probably best decided by which
one gives the smaller list. I'm not sure 'safety' is an issue.

As far as a regular expression is concerned, the difference between the
two is whether to use the NOT (!) operator or not (or use an else
statement).

To build the white/black list, use a string of characters and the
RegExp() function as a constructor, e.g. if you want to disallow the
letter 'a' in a string, then:

var re = new RegExp('a');

will create a regular expression that can be used to match the letter
'a' anywhere, e.g.:

if ( re.test(someStr ing) )
{
// someString contains the letter 'a'
} else {
// someString doesn't contain the letter 'a'
}

or:

if ( ! re.test(someStr ing) )
{
// someString doesn't contain the letter 'a'
}

To make the regular expression case-insensitive, add the 'i' flag:

var re = new RegExp('a','i') ;

To match any word character or the '$' character:

var re = new RegExp('[\\w$]');

To match any non-word character (not part of: a-z, A-Z, 0-9):

var re = new RegExp('\\W');

You can build the expression and flags as string variables and use those:

var reString = '\\W'; // Expression string
var flString = 'g'; // Flag string
var re = new RegExp(reString , flString);

and so on... Search the archives for lots of examples.

--
Rob

**Thomas 'PointedEars' Lahn** · Feb 17 '06, 02:25 PM

Re: regex/replace white list

RobG wrote:
[color=blue]
> To build the white/black list, use a string of characters and the
> RegExp() function as a constructor, e.g. if you want to disallow the
> letter 'a' in a string, then:
>
> var re = new RegExp('a');
>
> will create a regular expression that can be used to match the letter
> 'a' anywhere, [...][/color]

While there is not much point in using the RegExp() constructor instead
of a Regular Expression literal when the expression is invariant. As was
discussed here recently, efficiency and compatibility are seldom an issue:

As for efficiency, the RegExp object created by a RegExp literal is created
before execution, and the literal is then merely a reference to that
object. The RegExp object is not recreated by repeated use of the same
literal (say, in a loop). (Which must be considered regarding efficiency,
though, since this will create a new RegExp object always if the expression
differs, unconditionally . Even if the object is used only when a certain
condition applies.)

As for compatibility, even though RegExp literals have not been specified
before ECMAScript Edition 3 (issued 1999, seven years ago already, though),
they are supported since JavaScript 1.2 (Netscape 4.0, June 1997) except
of the `m' modifier. They are supported including the `m' modifier since
JavaScript 1.5 (Mozilla/5.0 rv:0.6, November 2000) and JScript 3.0
(Internet Explorer 4.0, and Internet Information Server 4.0, October 1997).
(The problems that remain compared to ECMAScript Edition 3 are non-capturing
parantheses and non-greedy expressions that are not universally supported,
but you have to deal with those problems with the RegExp() constructor as
well.)

However, using the RegExp constructor removes and introduces a maintenance
problem. It removes the problem that Regular Expressions cannot span lines
because string concatenation serves the purpose. It introduces the problem
that one has to escape the expression twice: one time to avoid escape
sequences in the string literal, and again to have RegExp special
characters parsed as expression atoms instead. (This is often very
confusing to people who are fairly new to the language.)

var re = /a/;

and the like certainly suffices here.

As I final note, I want to add that if special features of Regular
Expressions compared to strings are not used, it is probably more
efficient not to use Regular Expressions at all. Instead of writing

if (re.test(someSt ring))

using the RegExp() constructor or the above RegExp object initializer,
it is probably more efficient to write

if (someString.ind exOf("a") > -1)

instead.

PointedEars

**RobG** · Feb 20 '06, 06:05 AM

Re: regex/replace white list

Thomas 'PointedEars' Lahn wrote:[color=blue]
> RobG wrote:
>
>[color=green]
>>To build the white/black list, use a string of characters and the
>>RegExp() function as a constructor, e.g. if you want to disallow the
>>letter 'a' in a string, then:
>>
>> var re = new RegExp('a');
>>
>>will create a regular expression that can be used to match the letter
>>'a' anywhere, [...][/color]
>
>
> While there is not much point in using the RegExp() constructor instead
> of a Regular Expression literal when the expression is invariant.[/color]

My understanding of the request is that the string *is* variant. The OP
wishes to build a list of characters to allow/disallow, I presumed it
would not be hard-coded - though it might be built that way at the
server where the value is extracted from a database and the appropriate
value hard-coded into the script.

But I supposed that the value would written to some variable, which is
then accessed by the script, e.g.

var blackList = '$%#';

and then later:

var re = new RegExp('[' + blacklist + ']');

[color=blue]
> of a Regular Expression literal when the expression is invariant. As was
> discussed here recently, efficiency and compatibility are seldom an issue:
>
> As for efficiency, the RegExp object created by a RegExp literal is created
> before execution, and the literal is then merely a reference to that
> object. The RegExp object is not recreated by repeated use of the same
> literal (say, in a loop). (Which must be considered regarding efficiency,
> though, since this will create a new RegExp object always if the expression
> differs, unconditionally . Even if the object is used only when a certain
> condition applies.)[/color]

Quite true, I was addressing efficiency from the point of view of the
length of the expression. e.g. to allow only letters and digits, \w
will do the trick. To disallow only '@#$' then - [@#$] - is much
shorter than a list of everything else.

The difference in efficiency between using RegExp as a constructor and
using a literal in the above scenario is likely irrelevant (though I
understand your point and in general much prefer to use literals).

[...][color=blue]
> However, using the RegExp constructor removes and introduces a maintenance
> problem. It removes the problem that Regular Expressions cannot span lines
> because string concatenation serves the purpose. It introduces the problem
> that one has to escape the expression twice: one time to avoid escape
> sequences in the string literal, and again to have RegExp special
> characters parsed as expression atoms instead.[/color]

Escaping characters is always an issue, especially if multi-line input
is accepted. Should new lines & line feeds be allowed? The solution is
for the OP to learn about matching characters and apply that to their
particular circumstance.

[...][color=blue]
>
> var re = /a/;
>
> and the like certainly suffices here.[/color]

Probably a result of my trivial example - a better example is below.

[color=blue]
> As I final note, I want to add that if special features of Regular
> Expressions compared to strings are not used, it is probably more
> efficient not to use Regular Expressions at all. Instead of writing
>
> if (re.test(someSt ring))
>
> using the RegExp() constructor or the above RegExp object initializer,
> it is probably more efficient to write
>
> if (someString.ind exOf("a") > -1)
>[/color]

If the need was a test for a specific character, then that would be
fine. Maybe you could use it with a loop to go through each character
in the black list, but how many characters/loops would it take before a
regular expression was faster?

The following example may be better:

<script type="text/javascript">

function checkList(blID, strID)
{
var blackList = document.getEle mentById(blID). value;
var inString = document.getEle mentById(strID) .value;
var re = new RegExp('[' + blackList + ']');
document.getEle mentById('xx'). innerHTML = re.test(inStrin g);
}
</script>

<label for="blackList" >Blacklist characters:<inp ut
type="text" id="blackList" value="\^\]$#@"></label><br>

<label for="inputText" >String to check:<input
type="text" id="inputText" value="Cost: $6"></label>

<input type="button" value="Check input with blacklist"
onclick="checkL ist('blackList' ,'inputText');" >

<div>Result: <span id="xx" style="font-weight: bold;">
<i>no check done yet...</i></span></div>

If new lines, line feeds, etc. need to be tested too, use a textarea
instead of a text input for the input string. Variations on how
browsers represent new lines may need to be accommodated too.

--
Rob

**Thomas 'PointedEars' Lahn** · Feb 20 '06, 12:05 PM

Re: regex/replace white list

RobG wrote:
[color=blue]
> Thomas 'PointedEars' Lahn wrote:[color=green]
>> However, using the RegExp constructor removes and introduces a
>> maintenance problem. It removes the problem that Regular Expressions
>> cannot span lines because string concatenation serves the purpose. It
>> introduces the problem that one has to escape the expression twice: one
>> time to avoid escape sequences in the string literal, and again to have
>> RegExp special characters parsed as expression atoms instead.[/color]
>
> Escaping characters is always an issue, especially if multi-line input
> is accepted. Should new lines & line feeds be allowed?[/color]

You misunderstood. This was not about matching newline in the input.
[color=blue]
> The solution is for the OP to learn about matching characters and apply
> that to their particular circumstance.[/color]

My point was that

var rx = /very_long_Regul ar_Expression.a .b.c.d.e.f.g.h. i.j.k.l.m.n.o.p .
r.s.t.u.v.w.x.y .z.\..#.#.4.2.1 .3.3.7./

is not possible (consider the above a _hard_ line break to avoid crossing
the 80-columns border), but

var rx = new RegExp(
"very_long_Regu lar_Expression. a.b.c.d.e.f.g.h .i.j.k.l.m.n.o. p."
+ "r.s.t.u.v.w.x. y.z.\\..#.#.4.2 .1.3.3.7.");

(and the like) is. The latter introduces the maintenance problem that the
literal "." must be escaped twice, but it removes the maintenance problem
that literals are not allowed to span lines (in the source code).
[color=blue][color=green]
>> As I final note, I want to add that if special features of Regular
>> Expressions compared to strings are not used, it is probably more
>> efficient not to use Regular Expressions at all. Instead of writing
>>
>> if (re.test(someSt ring))
>>
>> using the RegExp() constructor or the above RegExp object initializer,
>> it is probably more efficient to write
>>
>> if (someString.ind exOf("a") > -1)[/color]
>
> If the need was a test for a specific character, then that would be
> fine. Maybe you could use it with a loop to go through each character
> in the black list, but how many characters/loops would it take before a
> regular expression was faster?[/color]

I do not know. This was a general note.
[color=blue]
> The following example may be better:[/color]

Maybe not :)
[color=blue]
> <script type="text/javascript">
>
> function checkList(blID, strID)
> {
> var blackList = document.getEle mentById(blID). value;
> var inString = document.getEle mentById(strID) .value;[/color]

A `form' element would have avoided the inefficient and not downwards
compatible referencing.

function checkList(f, blId, strID)
{
var es;
if (blID && strID
&& f && (es = f.elements)
&& es[blID] && es[strID])
{
var blackList = es[blID].value;
var inString = es[strID].value;

// ...
}
else
{
window.alert("f oobar!");
}

return false;
}

<form action="..."
onsubmit="check List(this, 'blackList', 'inputText');">
...
<input type="submit" value="Check input with blacklist">
</form>
[color=blue]
> var re = new RegExp('[' + blackList + ']');[/color]

What about the escaping part? You do not want the user to handle that,
do you?
[color=blue]
> document.getEle mentById('xx'). innerHTML = re.test(inStrin g);[/color]

Mixing standards compliant and proprietary DOM features unnecessarily.

es["xx"].style.fontStyl e = "normal"; // I prefer setStylePropert y()[1]
es["xx"].value = re.test(inStrin g);

<form ...>
...
<div>Result: <input id="xx"
value="no check done yet..."
style="border:0 ; font-weight:bold; font-style:italic"></div>
</form>
[color=blue]
> [...][/color]

PointedEars
___________
[1] <URL:http://pointedears.de/scripts/dhtml.js>

regex/replace white list

regex/replace white list

Comment

Comment

Comment

Comment