Removing Bad Words

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • Jim Carlock

    Removing Bad Words

    Looking for suggestions on how to handle bad words that might
    get passed in through $_GET['item'] variables.

    My first thoughts included using str_replace() to strip out such
    content, but then one ends up looking for characters that wrap
    around the stripped characters and it ends up as a recursive
    ordeal that fails to identify a poorly constructed $_GET['item']
    variable (when someone hand-types the item into the line and
    makes a simple typing error).

    So the next thoughts involved employing a list of good words
    and if any word in the $_GET['item'] list doesn't fall into the
    list of good words, then an empty string gets returned.

    Any suggestions on how to handle this?

    Thanks,

    Jim Carlock



  • Janwillem Borleffs

    #2
    Re: Removing Bad Words

    Jim Carlock wrote:[color=blue]
    > Any suggestions on how to handle this?
    >[/color]

    You will have to implement "fuzzy logics" which wil be able to filter not
    only "badword" but also "b a d w o r d", "b@d word", "b*dword", etcetera.

    Although you should be able to catch some of those, the best filter is still
    the human moderator...


    JW


    Comment

    • Stephen Poley

      #3
      Re: Removing Bad Words

      On Wed, 22 Feb 2006 19:36:41 GMT, "Jim Carlock" <anonymous@127. 0.0.1>
      wrote:
      [color=blue]
      >Looking for suggestions on how to handle bad words that might
      >get passed in through $_GET['item'] variables.
      >
      >My first thoughts included using str_replace() to strip out such
      >content, but then one ends up looking for characters that wrap
      >around the stripped characters and it ends up as a recursive
      >ordeal that fails to identify a poorly constructed $_GET['item']
      >variable (when someone hand-types the item into the line and
      >makes a simple typing error).
      >
      >So the next thoughts involved employing a list of good words
      >and if any word in the $_GET['item'] list doesn't fall into the
      >list of good words, then an empty string gets returned.
      >
      >Any suggestions on how to handle this?[/color]

      Automatic removal is just about impossible to do reliably. (People
      living in places such as Sussex and Scunthorpe have complained that
      their addresses get rejected by some sites.) If at all possible use a
      matching routine to detect doubtful entries and place them on one side
      for subsequent manual review.

      --
      Stephen Poley


      Comment

      • Michael Austin

        #4
        Re: Removing Bad Words

        Jim Carlock wrote:
        [color=blue]
        > Looking for suggestions on how to handle bad words that might
        > get passed in through $_GET['item'] variables.
        >
        > My first thoughts included using str_replace() to strip out such
        > content, but then one ends up looking for characters that wrap
        > around the stripped characters and it ends up as a recursive
        > ordeal that fails to identify a poorly constructed $_GET['item']
        > variable (when someone hand-types the item into the line and
        > makes a simple typing error).
        >
        > So the next thoughts involved employing a list of good words
        > and if any word in the $_GET['item'] list doesn't fall into the
        > list of good words, then an empty string gets returned.
        >
        > Any suggestions on how to handle this?
        >
        > Thanks,
        >
        > Jim Carlock
        >
        >
        >[/color]

        Jim, Not knowing your requirments or what the website will be used for makes it
        a little difficult to give you a solution. Would a drop-down list of acceptable
        words be better than expecting the user to type them correctly?

        That being said, if you type as badly as I do, you have probably made all of teh
        tpying errors most commonly seen. Including a str_replace() for all of those
        examples would not be that difficult - better yet include it into a javascript
        and let the client-side handle the word-corrections (onclick or onsubmit).

        I have worked with several products (OS and database) that will auto-correct
        some commands like: eixt = EXIT or comit=COMMIT etc... Digital TOPS10/20 OS
        that ran on the KL10/20 systems (36bit - circa mid 70's early 80's) would prompt
        you for a yes/no to:
        did you mean [whatever the correct spelling of the command is] Pretty cool for
        it's day...

        --
        Michael Austin.
        DBA Consultant
        Donations welcomed. Http://www.firstdbasource.com/donations.html
        :)

        Comment

        • Jim Carlock

          #5
          Re: Removing Bad Words

          Jim Carlock wrote:[color=blue]
          > So the next thoughts involved employing a list of good words
          > and if any word in the $_GET['item'] list doesn't fall into the
          > list of good words, then an empty string gets returned.
          >
          > Any suggestions on how to handle this?
          >[/color]
          "Michael Austin" replied:[color=blue]
          > Jim, Not knowing your requirments or what the website will be
          > used for makes it a little difficult to give you a solution. Would
          > a drop-down list of acceptable words be better than expecting
          > the user to type them correctly?[/color]

          Well a drop down list will go into the making for some things, but
          anyone can edit the line of text in the address-bar. And so instead
          of filtering for bad words, I'm looking for suggestions on how to
          parse through a list of good words (stored inside an array) and if
          any of the words in the address bar fail to match the words in the
          any of the words in the array, the individual gets routed to a
          bad-word page (the website homepage). I see a database as a
          very useful option but I'm working with PHP arrays at the
          moment. The database will be the future, but for the moment, I
          think an array of 200 possible words might work very well.

          Just need an effective way to compare a word to a list of words
          inside an array and return true if it matches, false if it fails the
          match.

          My thoughts include:

          function IsValidWord($sC heckThis) {
          global $aWords;
          foreach($aWords as $sWord) {
          if ($sWord === $sCheckThis) {
          return(TRUE);
          }
          }
          return(FALSE);
          }

          So I'm looking for any other suggestions.
          [color=blue]
          > That being said, if you type as badly as I do, you have probably
          > made all of teh tpying errors most commonly seen. Including a
          > str_replace() for all of those examples would not be that difficult
          > - better yet include it into a javascript and let the client-side
          > handle the word-corrections (onclick or onsubmit).[/color]

          The list of words is to remain on the server, so JavaScript in this
          case, seems to be an invalid option. Any mistyped words are to
          route the client to the homepage, or perhaps present the page in
          question with no selections selected. Either/or seems appropriate
          in this case.

          <snip>...</snip>

          Jim Carlock
          Post replies to the group.


          Comment

          • Chung Leong

            #6
            Re: Removing Bad Words

            The function you need is in_array() although an associative array would
            be more efficient. E.g.

            $good_hash = array(
            'good' => true,
            'better' => true,
            'best' => true,
            ...
            );

            if(!array_key_e xists(strtolowe r($word), $good_word)) {
            ...
            }

            Comment

            • Jim Carlock

              #7
              Re: Array Storage: Lowercase Versus Mixed-case [Topic was: Removing Bad Words]

              On 23 Feb 2006 00:29:48 GMT,
              "Chung Leong" <chernyshevsky@ hotmail.com> posted:[color=blue]
              > The function you need is in_array() although an associative array
              > would be more efficient. E.g.[/color]

              $good_hash = array(
              'good' => true,
              'better' => true,
              'best' => true,
              ...
              );

              if(!array_key_e xists(strtolowe r($word), $good_word)) {
              ...
              }

              Thanks, Chung. It seems like it's best to store everything inside the
              array as lowercase and then fill in some appropriate variables for.

              I initially started out with mixed-case arrays. For example:

              // array of states
              function Create_USA_Stat es_Array() {
              $aStates = array(
              // http://www.usps.com/ncsc/lookups/usp...eviations.html
              array("Alabama" , "AL"),
              array("Alaska", "AK"),
              array("Arizona" , "AZ"),
              array("Arkansas ", "AR"),
              array("Californ ia", "CA"),
              array("Colorado ", "CO"),
              array("Connecti cut", "CT"),
              array("Deleware ", "DE"),
              array("Florida" , "FL"),
              array("Georgia" , "GA"),
              array("Hawaii", "HI"),
              array("Idaho", "ID"),
              array("Illinois ", "IL"),
              array("Indiana" , "IN"),
              array("Iowa", "IA"),
              array("Kansas", "KS"),
              array("Kentucky ", "KY"),
              array("Louisian a", "LA"),
              array("Maine", "ME"),
              array("Maryland ", "MD"),
              array("Massachu setts", "MA"),
              array("Michigan ", "MI"),
              array("Minnesot a", "MN"),
              array("Mississi ppi", "MS"),
              array("Missouri ", "MO"),
              array("Montana" , "MT"),
              array("Nebraska ", "NE"),
              array("Nevada", "NV"),
              array("New Hampshire", "NH"),
              array("New Jersey", "NJ"),
              array("New Mexico", "NM"),
              array("New York", "NY"),
              array("North Carolina", "NC"),
              array("North Dakota", "ND"),
              array("Ohio", "OH"),
              array("Oklahoma ", "OK"),
              array("Oregon", "OR"),
              array("Pennsylv ania", "PA"),
              array("Rhode Island", "RI"),
              array("South Carolina", "SC"),
              array("South Dakota", "SD"),
              array("Tennesse e", "TN"),
              array("Texas", "TX"),
              array("Utah", "UT"),
              array("Vermont" , "VT"),
              array("Virginia ", "VA"),
              array("Washingt on", "WA"),
              array("Washingt on, D.C.", "DC"),
              array("West Virginia", "WV"),
              array("Wisconsi n", "WI"),
              array("Wyoming" , "WY"));
              return($aStates );
              }

              The function established to return a state name works as follows:

              // this function is incomplete
              // PURPOSE: RETURN statename from parameter passed in
              // INPUT: City-State String, OPTIONAL default string
              // RETURNS: empty string if invalid parameter requested
              // $sDS represents default state name to return
              // $sCS = $_GET['citystate'];
              // "Charlotte NC" or "Charlotte North Carolina" or "Charlotte" or
              // "usertyped garbage"
              function GetStateNameFro mCityState($sCS , $sDS = "") {
              $sStateAbbr = trim($sCS);
              $iLen = strlen($sStateA bbr);
              // first check to see if empty string
              if (strlen($iLen < 2)) { return($sDS); }
              if (GetStateFromAb br($sStateAbbr) ) {
              // a valid abbreviation was passed in
              return(GetState FromAbbr($sStat eAbbr));
              }
              $aStates = Create_USA_Stat es_Array();
              // possible state name in parameter so check for a state name,
              // before checking against abbreviations
              foreach ($aStates as $aState) {
              // state name: $aState[0]
              if (stristr($sStat eAbbr, $aState[0]) != FALSE) {
              // return state name
              return($aState[0]);
              }
              }
              // no valid statename found, so start abbreviation checks
              // first determine if there's an abbreviation present
              // explode(separat or, string to separate)
              $aWords = explode(" ", $sStateAbbr);
              $yAbbrFound = FALSE;
              // check for abbreviations
              foreach ($aWords as $sWord) {
              if (strlen($sWord) == 2) {
              // assume a 2-letter word represents a state abbreviation
              $sStateAbbr = $sWord;
              $yAbbrFound = TRUE;
              break;
              }
              }
              if ($yAbbrFound) {
              } else {
              // no abbreviation to check, so return empty string
              return($sDS);
              }
              // now validate abbreviation found
              // COULD this fail? NEEDS MORE TESTING.
              foreach ($aStates as $aState) {
              // now check against abbreviations
              if (stristr($sStat eAbbr, $aState[1]) != FALSE) {
              // return state name in proper formatting
              return($aState[1]);
              }
              }
              // return empty string when it all fails (default state)
              return($sDS);
              }

              Haven't fully tested the user-typed garbage being passed in, but
              my question specifically involves configuring the state array, and
              alternative suggestions for this.

              Note, that the above function actually returns what's found inside
              the predefined array, rather than what's found in the address-bar.
              This in effect, should get me words proper for HTML presentation,
              where I don't have to mess with capitalizing ALL state abbrev's,
              or capitalizing the first word of anything.

              I still need to test the code above some more, so if anyone happens
              to catch a flaw please point it out.

              And again back to the question in the topic... "Lowercase Versus
              Mixed-case" words inside the array that holds the states and state
              abbreviations. Anyone here that knows of a better way to do this?
              Another array might get created, as the list of targeted cities is over
              100 right at the moment. To possibly identify each city to a proper
              state.

              I plan on getting something going whereby a new array appears as
              follows:

              "city name", iStateNumber

              "state number" represents an integer 0 to 50 (51 states).
              Duplicate "city name"'s could exist, so the database, combines
              the "city name" and the "state number" into an index. The "state
              number" ends up being a pointer to the StateID in the State
              database. So continuing along the lines of the indexed arrays,
              as presented by Chung Leong, how would I go about indexing
              such an array as above and would indexing be appropriate for
              such?

              Thanks, Chung Leong. I did put the indexed array into play in
              another function where the number of items is greater. I didn't
              know how to work it into this particular array (or an array with
              multiple fields with duplicate records).

              Jim Carlock
              Post replies to the group.


              Comment

              • Chung Leong

                #8
                Re: Array Storage: Lowercase Versus Mixed-case [Topic was: Removing Bad Words]

                Jim Carlock wrote:[color=blue]
                > And again back to the question in the topic... "Lowercase Versus
                > Mixed-case" words inside the array that holds the states and state
                > abbreviations. Anyone here that knows of a better way to do this?
                > Another array might get created, as the list of targeted cities is over
                > 100 right at the moment. To possibly identify each city to a proper
                > state.[/color]

                Just have the static array be in mixed case, then generate the other
                one(s) programmaticall y:

                $states = array(
                "AL" => "Alabama",
                ...
                "WY" => "Wyoming"
                );

                $state_hash = array_flip(arra y_map('strtolow er', $states));

                Comment

                Working...