how to removie html tags from a string

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • Ra71sh
    New Member
    • Feb 2009
    • 3

    how to removie html tags from a string

    Hi,

    I store the comments as a text in database, but for special characters as an HTML tag.
    While fetching it in a text file, i just need the comments and no HTML tag.
    Is there any way to remove this using C program.

    Pls help.

    -Ratish
  • donbock
    Recognized Expert Top Contributor
    • Mar 2008
    • 2427

    #2
    Are you confident the html file is well-formed? If so, you could delete all text enclosed in angle brackets. Notice that horrible things will happen if the input file isn't well-formed.

    What are you supposed to do with character entity references (such as "&lt") and numeric entity references (such as "&#931) -- pass them through or expand them?

    Is this an assignment that you have to do in a particular way; or are you happy with any approach that works? Try opening the file with a browser and then saving it as a text file.

    Comment

    • Ra71sh
      New Member
      • Feb 2009
      • 3

      #3
      Actually the text is stored in database while formatting from the application screen.
      But I need to provide a report with the comments input for which I need to remove the html tags. I am using a C program to fetch the data from database.
      But it carries the html tags. I need to remove all the HTML tags that is in these comments.

      Comment

      • JosAH
        Recognized Expert MVP
        • Mar 2007
        • 11453

        #4
        Originally posted by Ra71sh
        Actually the text is stored in database while formatting from the application screen.
        But I need to provide a report with the comments input for which I need to remove the html tags. I am using a C program to fetch the data from database.
        But it carries the html tags. I need to remove all the HTML tags that is in these comments.
        Yes, you already wrote that: copy all characters on a line until you scan a '<'; stop copying but keep on scanning until you see a '>'. Repeat until you've reached the end of the line.

        As already mentioned character combinations such as '&lt;' and uglier pass unharmed.

        kind regards,

        Jos

        Comment

        • Ra71sh
          New Member
          • Feb 2009
          • 3

          #5
          that is fine but my concern is for the cases where someone has entered some text e.g. points where he has used <a>, <b>, or a>, b> etc.
          How would I handle this.
          Also, in cases where I have special handling like &amp; or #3688 etc what would I do?

          Comment

          • JosAH
            Recognized Expert MVP
            • Mar 2007
            • 11453

            #6
            Originally posted by Ra71sh
            that is fine but my concern is for the cases where someone has entered some text e.g. points where he has used <a>, <b>, or a>, b> etc.
            How would I handle this.
            Also, in cases where I have special handling like &amp; or #3688 etc what would I do?
            You have to write a complete fault tolerant html parser then. As a corollary I understand that your database contains incorrect html data? If so the GIGO prinicples rears its ugly head (Garbage In Garbage Out).

            kind regards,

            Jos

            Comment

            • donbock
              Recognized Expert Top Contributor
              • Mar 2008
              • 2427

              #7
              Originally posted by Ra71sh
              that is fine but my concern is for the cases where someone has entered some text e.g. points where he has used <a>, <b>, or a>, b> etc.
              How would I handle this.
              Also, in cases where I have special handling like &amp; or #3688 etc what would I do?
              In a well-formed html file, the input characters "<" and ">" would be replaced by "&lt" and "&gt". Is that happening for you? If not, then your file is malformed and you will have great difficulty parsing it.

              Regarding "&#3688" and its kin, don't ask us ... what do you think should happen? Are you emitting a text file? If so, then you are limited to printable characters. What is a meaningful and useful way to handle nonprintable characters in your context?

              Do you want to see the html page exactly as it would appear on a browser page -- with all the mark-ups? If so, then open the file with a browser and save or print to a pdf file.

              Comment

              • donbock
                Recognized Expert Top Contributor
                • Mar 2008
                • 2427

                #8
                A malformed html file can confuse the html parser:
                html parse error

                Comment

                • JosAH
                  Recognized Expert MVP
                  • Mar 2007
                  • 11453

                  #9
                  Originally posted by donbock
                  A malformed html file can confuse the html parser:
                  html parse error
                  That is so funny. ;-)

                  kind regards,

                  Jos

                  Comment

                  Working...