regex/replace white list

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • jgabbai@gmail.com

    regex/replace white list

    Hi,

    What is the best way to white list a set of allowable characters using
    regex or replace? I understand it is safer to whitelist than to
    blacklist, but am not sure how to go about it.

    Many thanks!

  • RobG

    #2
    Re: regex/replace white list

    jgabbai@gmail.c om wrote:[color=blue]
    > Hi,
    >
    > What is the best way to white list a set of allowable characters using
    > regex or replace? I understand it is safer to whitelist than to
    > blacklist, but am not sure how to go about it.[/color]

    Whether to use a white list (i.e. list of allowed characters) or a black
    list (list of not allowed characters) is probably best decided by which
    one gives the smaller list. I'm not sure 'safety' is an issue.

    As far as a regular expression is concerned, the difference between the
    two is whether to use the NOT (!) operator or not (or use an else
    statement).

    To build the white/black list, use a string of characters and the
    RegExp() function as a constructor, e.g. if you want to disallow the
    letter 'a' in a string, then:

    var re = new RegExp('a');

    will create a regular expression that can be used to match the letter
    'a' anywhere, e.g.:

    if ( re.test(someStr ing) )
    {
    // someString contains the letter 'a'
    } else {
    // someString doesn't contain the letter 'a'
    }

    or:

    if ( ! re.test(someStr ing) )
    {
    // someString doesn't contain the letter 'a'
    }

    To make the regular expression case-insensitive, add the 'i' flag:

    var re = new RegExp('a','i') ;


    To match any word character or the '$' character:

    var re = new RegExp('[\\w$]');


    To match any non-word character (not part of: a-z, A-Z, 0-9):

    var re = new RegExp('\\W');


    You can build the expression and flags as string variables and use those:

    var reString = '\\W'; // Expression string
    var flString = 'g'; // Flag string
    var re = new RegExp(reString , flString);


    and so on... Search the archives for lots of examples.



    --
    Rob

    Comment

    • Thomas 'PointedEars' Lahn

      #3
      Re: regex/replace white list

      RobG wrote:
      [color=blue]
      > To build the white/black list, use a string of characters and the
      > RegExp() function as a constructor, e.g. if you want to disallow the
      > letter 'a' in a string, then:
      >
      > var re = new RegExp('a');
      >
      > will create a regular expression that can be used to match the letter
      > 'a' anywhere, [...][/color]

      While there is not much point in using the RegExp() constructor instead
      of a Regular Expression literal when the expression is invariant. As was
      discussed here recently, efficiency and compatibility are seldom an issue:

      As for efficiency, the RegExp object created by a RegExp literal is created
      before execution, and the literal is then merely a reference to that
      object. The RegExp object is not recreated by repeated use of the same
      literal (say, in a loop). (Which must be considered regarding efficiency,
      though, since this will create a new RegExp object always if the expression
      differs, unconditionally . Even if the object is used only when a certain
      condition applies.)

      As for compatibility, even though RegExp literals have not been specified
      before ECMAScript Edition 3 (issued 1999, seven years ago already, though),
      they are supported since JavaScript 1.2 (Netscape 4.0, June 1997) except
      of the `m' modifier. They are supported including the `m' modifier since
      JavaScript 1.5 (Mozilla/5.0 rv:0.6, November 2000) and JScript 3.0
      (Internet Explorer 4.0, and Internet Information Server 4.0, October 1997).
      (The problems that remain compared to ECMAScript Edition 3 are non-capturing
      parantheses and non-greedy expressions that are not universally supported,
      but you have to deal with those problems with the RegExp() constructor as
      well.)

      However, using the RegExp constructor removes and introduces a maintenance
      problem. It removes the problem that Regular Expressions cannot span lines
      because string concatenation serves the purpose. It introduces the problem
      that one has to escape the expression twice: one time to avoid escape
      sequences in the string literal, and again to have RegExp special
      characters parsed as expression atoms instead. (This is often very
      confusing to people who are fairly new to the language.)

      var re = /a/;

      and the like certainly suffices here.

      As I final note, I want to add that if special features of Regular
      Expressions compared to strings are not used, it is probably more
      efficient not to use Regular Expressions at all. Instead of writing

      if (re.test(someSt ring))

      using the RegExp() constructor or the above RegExp object initializer,
      it is probably more efficient to write

      if (someString.ind exOf("a") > -1)

      instead.


      PointedEars

      Comment

      • RobG

        #4
        Re: regex/replace white list

        Thomas 'PointedEars' Lahn wrote:[color=blue]
        > RobG wrote:
        >
        >[color=green]
        >>To build the white/black list, use a string of characters and the
        >>RegExp() function as a constructor, e.g. if you want to disallow the
        >>letter 'a' in a string, then:
        >>
        >> var re = new RegExp('a');
        >>
        >>will create a regular expression that can be used to match the letter
        >>'a' anywhere, [...][/color]
        >
        >
        > While there is not much point in using the RegExp() constructor instead
        > of a Regular Expression literal when the expression is invariant.[/color]

        My understanding of the request is that the string *is* variant. The OP
        wishes to build a list of characters to allow/disallow, I presumed it
        would not be hard-coded - though it might be built that way at the
        server where the value is extracted from a database and the appropriate
        value hard-coded into the script.

        But I supposed that the value would written to some variable, which is
        then accessed by the script, e.g.

        var blackList = '$%#';

        and then later:

        var re = new RegExp('[' + blacklist + ']');

        [color=blue]
        > of a Regular Expression literal when the expression is invariant. As was
        > discussed here recently, efficiency and compatibility are seldom an issue:
        >
        > As for efficiency, the RegExp object created by a RegExp literal is created
        > before execution, and the literal is then merely a reference to that
        > object. The RegExp object is not recreated by repeated use of the same
        > literal (say, in a loop). (Which must be considered regarding efficiency,
        > though, since this will create a new RegExp object always if the expression
        > differs, unconditionally . Even if the object is used only when a certain
        > condition applies.)[/color]

        Quite true, I was addressing efficiency from the point of view of the
        length of the expression. e.g. to allow only letters and digits, \w
        will do the trick. To disallow only '@#$' then - [@#$] - is much
        shorter than a list of everything else.

        The difference in efficiency between using RegExp as a constructor and
        using a literal in the above scenario is likely irrelevant (though I
        understand your point and in general much prefer to use literals).

        [...][color=blue]
        > However, using the RegExp constructor removes and introduces a maintenance
        > problem. It removes the problem that Regular Expressions cannot span lines
        > because string concatenation serves the purpose. It introduces the problem
        > that one has to escape the expression twice: one time to avoid escape
        > sequences in the string literal, and again to have RegExp special
        > characters parsed as expression atoms instead.[/color]

        Escaping characters is always an issue, especially if multi-line input
        is accepted. Should new lines & line feeds be allowed? The solution is
        for the OP to learn about matching characters and apply that to their
        particular circumstance.


        [...][color=blue]
        >
        > var re = /a/;
        >
        > and the like certainly suffices here.[/color]

        Probably a result of my trivial example - a better example is below.

        [color=blue]
        > As I final note, I want to add that if special features of Regular
        > Expressions compared to strings are not used, it is probably more
        > efficient not to use Regular Expressions at all. Instead of writing
        >
        > if (re.test(someSt ring))
        >
        > using the RegExp() constructor or the above RegExp object initializer,
        > it is probably more efficient to write
        >
        > if (someString.ind exOf("a") > -1)
        >[/color]

        If the need was a test for a specific character, then that would be
        fine. Maybe you could use it with a loop to go through each character
        in the black list, but how many characters/loops would it take before a
        regular expression was faster?

        The following example may be better:

        <script type="text/javascript">

        function checkList(blID, strID)
        {
        var blackList = document.getEle mentById(blID). value;
        var inString = document.getEle mentById(strID) .value;
        var re = new RegExp('[' + blackList + ']');
        document.getEle mentById('xx'). innerHTML = re.test(inStrin g);
        }
        </script>


        <label for="blackList" >Blacklist characters:<inp ut
        type="text" id="blackList" value="\^\]$#@"></label><br>

        <label for="inputText" >String to check:<input
        type="text" id="inputText" value="Cost: $6"></label>

        <input type="button" value="Check input with blacklist"
        onclick="checkL ist('blackList' ,'inputText');" >

        <div>Result: <span id="xx" style="font-weight: bold;">
        <i>no check done yet...</i></span></div>


        If new lines, line feeds, etc. need to be tested too, use a textarea
        instead of a text input for the input string. Variations on how
        browsers represent new lines may need to be accommodated too.



        --
        Rob

        Comment

        • Thomas 'PointedEars' Lahn

          #5
          Re: regex/replace white list

          RobG wrote:
          [color=blue]
          > Thomas 'PointedEars' Lahn wrote:[color=green]
          >> However, using the RegExp constructor removes and introduces a
          >> maintenance problem. It removes the problem that Regular Expressions
          >> cannot span lines because string concatenation serves the purpose. It
          >> introduces the problem that one has to escape the expression twice: one
          >> time to avoid escape sequences in the string literal, and again to have
          >> RegExp special characters parsed as expression atoms instead.[/color]
          >
          > Escaping characters is always an issue, especially if multi-line input
          > is accepted. Should new lines & line feeds be allowed?[/color]

          You misunderstood. This was not about matching newline in the input.
          [color=blue]
          > The solution is for the OP to learn about matching characters and apply
          > that to their particular circumstance.[/color]

          My point was that

          var rx = /very_long_Regul ar_Expression.a .b.c.d.e.f.g.h. i.j.k.l.m.n.o.p .
          r.s.t.u.v.w.x.y .z.\..#.#.4.2.1 .3.3.7./

          is not possible (consider the above a _hard_ line break to avoid crossing
          the 80-columns border), but

          var rx = new RegExp(
          "very_long_Regu lar_Expression. a.b.c.d.e.f.g.h .i.j.k.l.m.n.o. p."
          + "r.s.t.u.v.w.x. y.z.\\..#.#.4.2 .1.3.3.7.");

          (and the like) is. The latter introduces the maintenance problem that the
          literal "." must be escaped twice, but it removes the maintenance problem
          that literals are not allowed to span lines (in the source code).
          [color=blue][color=green]
          >> As I final note, I want to add that if special features of Regular
          >> Expressions compared to strings are not used, it is probably more
          >> efficient not to use Regular Expressions at all. Instead of writing
          >>
          >> if (re.test(someSt ring))
          >>
          >> using the RegExp() constructor or the above RegExp object initializer,
          >> it is probably more efficient to write
          >>
          >> if (someString.ind exOf("a") > -1)[/color]
          >
          > If the need was a test for a specific character, then that would be
          > fine. Maybe you could use it with a loop to go through each character
          > in the black list, but how many characters/loops would it take before a
          > regular expression was faster?[/color]

          I do not know. This was a general note.
          [color=blue]
          > The following example may be better:[/color]

          Maybe not :)
          [color=blue]
          > <script type="text/javascript">
          >
          > function checkList(blID, strID)
          > {
          > var blackList = document.getEle mentById(blID). value;
          > var inString = document.getEle mentById(strID) .value;[/color]

          A `form' element would have avoided the inefficient and not downwards
          compatible referencing.

          function checkList(f, blId, strID)
          {
          var es;
          if (blID && strID
          && f && (es = f.elements)
          && es[blID] && es[strID])
          {
          var blackList = es[blID].value;
          var inString = es[strID].value;

          // ...
          }
          else
          {
          window.alert("f oobar!");
          }

          return false;
          }

          <form action="..."
          onsubmit="check List(this, 'blackList', 'inputText');">
          ...
          <input type="submit" value="Check input with blacklist">
          </form>
          [color=blue]
          > var re = new RegExp('[' + blackList + ']');[/color]

          What about the escaping part? You do not want the user to handle that,
          do you?
          [color=blue]
          > document.getEle mentById('xx'). innerHTML = re.test(inStrin g);[/color]

          Mixing standards compliant and proprietary DOM features unnecessarily.

          es["xx"].style.fontStyl e = "normal"; // I prefer setStylePropert y()[1]
          es["xx"].value = re.test(inStrin g);

          <form ...>
          ...
          <div>Result: <input id="xx"
          value="no check done yet..."
          style="border:0 ; font-weight:bold; font-style:italic"></div>
          </form>
          [color=blue]
          > [...][/color]


          PointedEars
          ___________
          [1] <URL:http://pointedears.de/scripts/dhtml.js>

          Comment

          Working...