Treating text copied from MS Word

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • +mrcakey

    Treating text copied from MS Word

    I've built a MySQL database for a client and a web interface to be able to
    add/edit/delete records in it. When he's adding stuff to the database he's
    copying text from MS Word. I've tried various substitutions that I've found
    hanging around the internet, but nothing's working for the "long dash" that
    it insists on converting normal hyphens to.

    This morning I did a bin2hex to see exactly what was being sent from $_POST:

    A - long dash -.

    41 20 >>>e2 80 93<<< 20 6c 6f 6e 67 20 64 61 73 68 20 2d 2e 20 20

    The offending character is the one I've highlighted. As far as I can tell,
    it should be getting found by this -

    "\\xe2\\x80\\x9 3", // long dash

    but it isn't, which makes me think there's something wrong with the code
    I've copied. How to find the hex string? I've tried "\xe2\x80\x 93" and
    "\xe2x80x93 " in addition, but to no avail.

    Is driving me scatty!!!

    Any help much appreciated.

    $search = array( chr(145),
    chr(146),
    chr(147),
    chr(148),
    chr(151),
    chr(196),
    'â?o', // left side double smart quote
    '�', // right side double smart quote
    'â?~', // left side single smart quote
    'â?T', // right side single smart quote
    'â?¦', // elipsis
    'â?"', // em dash
    'â?"', // en dash
    "\\xe2\\x80\\xa 6", // ellipsis
    "\\xe2\\x80\\x9 3", // long dash
    "\\xe2\\x80\\x9 4", // long dash
    "\\xe2\\x80\\x9 c", // double quote opening
    "\\xe2\\x80\\x9 d", // double quote closing
    "\\xe2\\x80\\xa 2" // dot used for bullet points
    );
    $replace = array( "'",
    "'",
    '"',
    '"',
    '-',
    '-',
    '"',
    '"',
    "'",
    "'",
    "&hellip;",
    "-",
    "-",
    '&hellip;',
    '-',
    '-',
    '"',
    '"',
    '*'
    );
    ECHO '<p>'.BIN2HEX( $_POST['short_desc'] ).'</p>';
    $short_desc = STR_REPLACE($se arch, $replace, $_POST['short_desc']);

    +mrcakey


  • C. (http://symcbean.blogspot.com/)

    #2
    Re: Treating text copied from MS Word

    On Jul 9, 12:03 pm, "+mrcakey" <webmas...@list yblue.comwrote:
    I've built a MySQL database for a client and a web interface to be able to
    add/edit/delete records in it.  When he's adding stuff to the database he's
    copying text from MS Word.  I've tried various substitutions that I've found
    hanging around the internet, but nothing's working for the "long dash" that
    it insists on converting normal hyphens to.
    >
    This morning I did a bin2hex to see exactly what was being sent from $_POST:
    >
    A - long dash -.
    >
    41 20 >>>e2 80 93<<< 20 6c 6f 6e 67 20 64 61 73 68 20 2d 2e 20 20
    >
    The offending character is the one I've highlighted.  As far as I can tell,
    it should be getting found by this -
    >
    "\\xe2\\x80\\x9 3", // long dash
    >
    but it isn't, which makes me think there's something wrong with the code
    I've copied.  How to find the hex string?  I've tried "\xe2\x80\x 93" and
    "\xe2x80x93 " in addition, but to no avail.
    >
    <snip>

    Not really a PHP question - configure your webserver to use a 7 bit
    charset.

    C.

    Comment

    • I V

      #3
      Re: Treating text copied from MS Word

      On Wed, 09 Jul 2008 12:03:57 +0100, +mrcakey wrote:
      The offending character is the one I've highlighted. As far as I can
      tell, it should be getting found by this -
      >
      "\\xe2\\x80\\x9 3", // long dash
      You want to use one backslash here, not two. But, rather than specifying
      the search-and-replace yourself, it's probably easier to use
      htmlentities. You need to know what encoding your data has been sent in
      (it looks, from your post, like you're receiving UTF-8), and do, like so:

      $short_desc = htmlentities($_ POST['short_desc'], ENT_COMPAT, 'UTF-8');

      Comment

      • C. (http://symcbean.blogspot.com/)

        #4
        Re: Treating text copied from MS Word

        On Jul 10, 5:07 pm, "C. (http://symcbean.blogsp ot.com/)"
        <colin.mckin... @gmail.comwrote :
        On Jul 9, 12:03 pm, "+mrcakey" <webmas...@list yblue.comwrote:
        >
        I've built a MySQL database for a client and a web interface to be ableto
        add/edit/delete records in it.  When he's adding stuff to the database he's
        copying text from MS Word.  I've tried various substitutions that I've found
        hanging around the internet, but nothing's working for the "long dash" that
        it insists on converting normal hyphens to.
        >
        This morning I did a bin2hex to see exactly what was being sent from $_POST:
        >
        A - long dash -.
        >
        41 20 >>>e2 80 93<<< 20 6c 6f 6e 67 20 64 61 73 68 20 2d 2e 20 20
        >
        The offending character is the one I've highlighted.  As far as I cantell,
        it should be getting found by this -
        >
        "\\xe2\\x80\\x9 3", // long dash
        >
        but it isn't, which makes me think there's something wrong with the code
        I've copied.  How to find the hex string?  I've tried "\xe2\x80\x 93" and
        "\xe2x80x93 " in addition, but to no avail.
        >
        <snip>
        >
        Not really a PHP question - configure your webserver to use a 7 bit
        charset.
        >
        C.
        Sorry - bum steer. Apparrently MSIE is (once again) completely broken
        in this regard. There is a hack though - see


        C.

        Comment

        Working...