PHP4 : Extract text from HTML file

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • trihanhcie@gmail.com

    PHP4 : Extract text from HTML file

    Hi,

    I would like to extract the text in an HTML file
    For the moment, I'm trying to get all text between <tdand </td>. I
    used a regular expression because i don't know the "format between
    <tdand </td>

    It can be :
    <tdtext1 </td>
    or
    <td>
    text1
    </td>
    or anything else

    eregi("<td(.*)> (.*)(</td>?)",$text,$r egtext);

    The problem is that, if I have
    <tdtext</td>
    <td>text2</td>

    regtext will return text</td><td>text2.

    How can I change the expression so that it stops at the first occurence
    of </td>?

    Thanks

  • e.ahlback@gmail.com

    #2
    Re: PHP4 : Extract text from HTML file

    trihanhcie@gmai l.com wrote:
    Hi,
    >
    I would like to extract the text in an HTML file
    For the moment, I'm trying to get all text between <tdand </td>. I
    used a regular expression because i don't know the "format between
    <tdand </td>
    >
    It can be :
    <tdtext1 </td>
    or
    <td>
    text1
    </td>
    or anything else
    >
    eregi("<td(.*)> (.*)(</td>?)",$text,$r egtext);
    >
    The problem is that, if I have
    <tdtext</td>
    <td>text2</td>
    >
    regtext will return text</td><td>text2.
    >
    How can I change the expression so that it stops at the first occurence
    of </td>?
    >
    Thanks
    Hi.

    Not sure, but I think this is what you want.

    These function should be able to extract the text from any tags!

    Sorry if I'm wrong.

    Comment

    • e.ahlback@gmail.com

      #3
      Re: PHP4 : Extract text from HTML file

      e.ahlb...@gmail .com wrote:
      trihanhcie@gmai l.com wrote:
      Hi,

      I would like to extract the text in an HTML file
      For the moment, I'm trying to get all text between <tdand </td>. I
      used a regular expression because i don't know the "format between
      <tdand </td>

      It can be :
      <tdtext1 </td>
      or
      <td>
      text1
      </td>
      or anything else

      eregi("<td(.*)> (.*)(</td>?)",$text,$r egtext);

      The problem is that, if I have
      <tdtext</td>
      <td>text2</td>

      regtext will return text</td><td>text2.

      How can I change the expression so that it stops at the first occurence
      of </td>?

      Thanks
      >
      Hi.
      >
      Not sure, but I think this is what you want.

      These function should be able to extract the text from any tags!
      >
      Sorry if I'm wrong.
      Of course, I was wrong. Didn't notice that you were using PHP4.
      Take a look at http://fi.php.net/manual/en/ref.domxml.php instead.

      Comment

      • trihanhcie@gmail.com

        #4
        Re: PHP4 : Extract text from HTML file

        It looks like these functons are used for XML files, can it still be
        used for html files?



        e.ahlback@gmail .com wrote:
        e.ahlb...@gmail .com wrote:
        trihanhcie@gmai l.com wrote:
        Hi,
        >
        I would like to extract the text in an HTML file
        For the moment, I'm trying to get all text between <tdand </td>. I
        used a regular expression because i don't know the "format between
        <tdand </td>
        >
        It can be :
        <tdtext1 </td>
        or
        <td>
        text1
        </td>
        or anything else
        >
        eregi("<td(.*)> (.*)(</td>?)",$text,$r egtext);
        >
        The problem is that, if I have
        <tdtext</td>
        <td>text2</td>
        >
        regtext will return text</td><td>text2.
        >
        How can I change the expression so that it stops at the first occurence
        of </td>?
        >
        Thanks
        Hi.

        Not sure, but I think this is what you want.

        These function should be able to extract the text from any tags!

        Sorry if I'm wrong.
        >
        Of course, I was wrong. Didn't notice that you were using PHP4.
        Take a look at http://fi.php.net/manual/en/ref.domxml.php instead.

        Comment

        • e.ahlback@gmail.com

          #5
          Re: PHP4 : Extract text from HTML file


          trihanhcie@gmai l.com wrote:
          It looks like these functons are used for XML files, can it still be
          used for html files?
          That should be what they're for... Try it!

          Comment

          • Tim Martin

            #6
            Re: PHP4 : Extract text from HTML file

            trihanhcie@gmai l.com wrote:
            Hi,
            >
            I would like to extract the text in an HTML file
            For the moment, I'm trying to get all text between <tdand </td>. I
            used a regular expression because i don't know the "format between
            <tdand </td>
            >
            It can be :
            <tdtext1 </td>
            or
            <td>
            text1
            </td>
            or anything else
            >
            eregi("<td(.*)> (.*)(</td>?)",$text,$r egtext);
            >
            The problem is that, if I have
            <tdtext</td>
            <td>text2</td>
            >
            regtext will return text</td><td>text2.
            >
            How can I change the expression so that it stops at the first occurence
            of </td>?
            If that's all you want to change, then you can just add the '?' (minimal
            match) qualifier to the '.*' within your regexp. By default, the '*'
            operator is "greedy" (that is, matches as much data as possible). If you
            replace that with '.*?' it will find the minimum amount of text that
            satisfies your requirements.

            If you want heavier-duty HTML parsing, you're probably better of looking
            for a library rather than trying to do it all by hand anyway, as the
            other poster suggested.

            Tim

            Comment

            • trihanhcie@gmail.com

              #7
              Re: PHP4 : Extract text from HTML file

              Thanks for your advice :D Well the 'ungreedy' solution worked for the
              moment ;)
              I will try the library later :)


              Tim Martin wrote:
              trihanhcie@gmai l.com wrote:
              Hi,

              I would like to extract the text in an HTML file
              For the moment, I'm trying to get all text between <tdand </td>. I
              used a regular expression because i don't know the "format between
              <tdand </td>

              It can be :
              <tdtext1 </td>
              or
              <td>
              text1
              </td>
              or anything else

              eregi("<td(.*)> (.*)(</td>?)",$text,$r egtext);

              The problem is that, if I have
              <tdtext</td>
              <td>text2</td>

              regtext will return text</td><td>text2.

              How can I change the expression so that it stops at the first occurence
              of </td>?
              >
              If that's all you want to change, then you can just add the '?' (minimal
              match) qualifier to the '.*' within your regexp. By default, the '*'
              operator is "greedy" (that is, matches as much data as possible). If you
              replace that with '.*?' it will find the minimum amount of text that
              satisfies your requirements.
              >
              If you want heavier-duty HTML parsing, you're probably better of looking
              for a library rather than trying to do it all by hand anyway, as the
              other poster suggested.
              >
              Tim

              Comment

              • nerkn

                #8
                Re: PHP4 : Extract text from HTML file

                un fortunatelly, the document must be valid xml file. As thinking of
                most of the web masters, it is a idealistic case.

                e.ahlback@gmail .com wrote:
                e.ahlb...@gmail .com wrote:
                trihanhcie@gmai l.com wrote:
                Hi,
                >
                I would like to extract the text in an HTML file
                For the moment, I'm trying to get all text between <tdand </td>. I
                used a regular expression because i don't know the "format between
                <tdand </td>
                >
                It can be :
                <tdtext1 </td>
                or
                <td>
                text1
                </td>
                or anything else
                >
                eregi("<td(.*)> (.*)(</td>?)",$text,$r egtext);
                >
                The problem is that, if I have
                <tdtext</td>
                <td>text2</td>
                >
                regtext will return text</td><td>text2.
                >
                How can I change the expression so that it stops at the first occurence
                of </td>?
                >
                Thanks
                Hi.

                Not sure, but I think this is what you want.

                These function should be able to extract the text from any tags!

                Sorry if I'm wrong.
                >
                Of course, I was wrong. Didn't notice that you were using PHP4.
                Take a look at http://fi.php.net/manual/en/ref.domxml.php instead.

                Comment

                • Gertjan Klein

                  #9
                  Re: PHP4 : Extract text from HTML file

                  trihanhcie@gmai l.com wrote:
                  >eregi("<td(.*) >(.*)(</td>?)",$text,$r egtext);
                  >
                  >The problem is that, if I have
                  ><tdtext</td>
                  ><td>text2</td>
                  >
                  >regtext will return text</td><td>text2.
                  >
                  >How can I change the expression so that it stops at the first occurence
                  >of </td>?
                  The cause of the problem is that the regex is greedy (i.e., matches as
                  much as possible given the constraints of the expression). The simplest
                  solution, if you are sure that the table cell contents will have no
                  other markup, is to change the regex to "<td[^>]*>([^<]*)</td>". This
                  specifies that no open angle bracket can exist between the td and /td.

                  If you can't be sure of that, I'd suggest something like this:

                  preg_match('/<td[^>]*>(.*)<\/td>/imsU', $text, $regtext);

                  The modifiers in this regex specify that it should be non-greedy, case
                  insensitive, and regard newlines and not special. It only returns
                  information about the first <td></td>; if you want to get them all,
                  preg_match_all will do the trick with the same regex. (Tested on version
                  4.1.2.)

                  HTH,
                  Gertjan.
                  --
                  Gertjan Klein <gklein@xs4all. nl>

                  Comment

                  • Tim Martin

                    #10
                    Re: PHP4 : Extract text from HTML file

                    trihanhcie@gmai l.com wrote:
                    Hi,
                    >
                    I would like to extract the text in an HTML file
                    For the moment, I'm trying to get all text between <tdand </td>. I
                    used a regular expression because i don't know the "format between
                    <tdand </td>
                    [snip]

                    By the way, please don't waste people's time by multi-posting. If you
                    think this question is appropriate both to here and comp.programmin g [1]
                    then please cross-post it so that people in both groups can see the
                    responses and can avoid spending time answering a question that's
                    already been answered in the other group.

                    Tim

                    [1] In my opinion, it isn't

                    Comment

                    Working...