Preg_replace whole word only

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • empiresolutions
    New Member
    • Apr 2006
    • 162

    Preg_replace whole word only

    Im trying to make a naughty word filter. It removes bad words fine, but instances where there is a bad word found in the text like "assist" and "asses" get caught in the filter as well. Strangely though if the sentence is: My asses to assist me." the clean version will read: My asses to ***ist me." It seems to clear the first use of the word in another word, but then blocks the rest. Any ideas? My script is below. Thanks.

    Code:
    function cleanWords($value) {
    
    	/*   strip naughty words   */
    	$bad_word_file = 'standards/badwords.txt';
    	$strtofile = fopen($bad_word_file, "r");
    	$badwords = explode("\n", fread($strtofile, filesize($bad_word_file)));
    	fclose($strtofile);
    	
    	for ($i = 0; $i < count($badwords); $i++) {
    		$wordlist .= str_replace(chr(13),'',$badwords[$i]).'|';
    	}
    	$wordlist = substr($wordlist,0,-1);
    
    	$value = preg_replace("/\b($wordlist)\b/ie", 'preg_replace("/./","*","\\1")', $value);	
    	return $value;
    
    }
  • Atli
    Recognized Expert Expert
    • Nov 2006
    • 5062

    #2
    Hey.

    If you print the $wordlist, does it look right?
    I tested this by just creating the $wordlist manually and it seemed to work fine.

    Comment

    • empiresolutions
      New Member
      • Apr 2006
      • 162

      #3
      yes $wordlist is correct. If it helps the wordlist is just over a 1000 words.

      Comment

      • philipwayne
        New Member
        • Mar 2010
        • 50

        #4
        Use the space character with or conditions.

        (\s|^)(badword1 |badword2)(\s|$ )

        That checks for either a space before the word or if it is at the start of the screen. Then checks for either a space or the end of the line.

        Comment

        • empiresolutions
          New Member
          • Apr 2006
          • 162

          #5
          i ended up finding that the word "a.s.s." was in my list. I think the dots were messing up the expression. For thos interested, this is my new code. Thanks for any suggestions to get it where it is.

          Code:
          $_SESSION[wordlist] = join("|", array_map('trim', file('standards/badwords.txt')));
          
          function cleanWords($value) {
          
          	global $_SESSION;
          
          	$value = preg_replace("/\b($_SESSION[wordlist])\b/ie", 'str_repeat("*", strlen("\\1")) ', $value);	
          	return $value;
          
          }

          Comment

          • Atli
            Recognized Expert Expert
            • Nov 2006
            • 5062

            #6
            Hey.
            Glad you got it working.

            However, I would consider using a different method. - Putting the whole thing into the session is very inefficient. The list remains constant for every user, and rarely changes (if ever) right? - If so, then compiling it for every user like that and storing it in separate sessions for each one is just doing two things: eating up resources and cluttering the sessions with duplicate data.

            You would be far better of compiling the regular expression into a common file, shared between all users. - This is how I would do this. (Wouldn't usually make a ready-to-use code example, but since you already solved this on your own...)
            [code=php]<?php
            define("BADWORD S_RAW_FILE", "/path/to/badwords.txt");
            define("BADWORD S_EXP_FILE", "/path/to/badwords_expres sion.txt");

            /**
            * Returns a regular expression that can be used to check
            * for "bad" words. Returns an expression in the format:
            * - /\b(list|of|bad| words)\b/i
            */
            function getBadWordsRege xp()
            {
            $regexp = "";

            // Try to fetch an existing expression.
            if(!file_exists (BADWORDS_EXP_F ILE) ||
            filesize(BADWOR DS_EXP_FILE) <= 0 ||
            ($regexp = file_get_conten ts(BADWORDS_EXP _FILE)) === false)
            {
            // Make sure the raw word list exists
            if(!file_exists (BADWORDS_RAW_F ILE)) {
            trigger_error(" The bad words file does not exists.", E_USER_ERROR);
            return false;
            }

            // Compile the regular expression
            $regexp = '/\b(' . join("|", array_map('trim ', file(BADWORDS_R AW_FILE))) . ')\b/i';

            // Try to save it
            if(!is_writeabl e(BADWORDS_EXP_ FILE) ||
            !file_put_conte nts(BADWORDS_EX P_FILE, $regexp))
            {
            trigger_error(" Could not save badwords expression. Check file permissions.", E_USER_WARNING) ;
            }
            }

            // Return it
            return $regexp;
            }
            ?>[/code]

            Then you could use it like:
            [code=php]<?php
            function cleanWords($val ue) {
            $regexp = getBadWordsRege xp();
            return preg_replace($r egexp . 'e', 'str_repeat("*" , strlen("\\1")) ', $value);
            }
            ?>[/code]

            P.S.
            I have a couple of notes on your code, though.
            • You don't need to import $_SESSION into functions using the global keyword. $_SESSION is a "super-global", which makes it available to you wherever you are in your code.
            • All strings need to be quoted. That includes array keys. Which means that:
              [code=php]// This
              $_SESSION[wordlist];

              // Should be
              $_SESSION['wordlist'];[/code]
              If you leave it out, PHP assumes it is a constant. Failing to find a constant, it prints a warning and uses it as a string (which is why it works, even thought it is technically an error.) - For future-compatibility and performance reasons (minor as they may be), it is best to just remember the strings.

            Comment

            • empiresolutions
              New Member
              • Apr 2006
              • 162

              #7
              thanks Atli! your suggestions are much appreciated.

              Comment

              Working...