Conversion of pdf codings to xml codings

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • Puneeth kamath
    New Member
    • Dec 2010
    • 10

    Conversion of pdf codings to xml codings

    Hi,

    I am new to Asp.net with C#.Can anyone tell me the steps in creating tool which converts PDF codings to XML codings automatically.P lease help me and give me the best solutions.
  • Rabbit
    Recognized Expert MVP
    • Jan 2007
    • 12517

    #2
    The first step is to learn the PDF specification. It can be found on the adobe website. Once you learn that then we can work on the implementation.

    Comment

    • Puneeth kamath
      New Member
      • Dec 2010
      • 10

      #3
      ok sir.,thank you.please keep suggesting me

      Comment

      • Puneeth kamath
        New Member
        • Dec 2010
        • 10

        #4
        hi Sir.i studied pdf specification.. what is the next step?.please guide me to my destiny.

        Comment

        • Rabbit
          Recognized Expert MVP
          • Jan 2007
          • 12517

          #5
          Once you know the PDF format, you can start writing your code. You'll want to create a filestream to read the PDF file and then a different filestream to write out the XML. After the file is created, you loop through the PDF file and use your new knowledge of the specification to interpret it into the format you want.

          This is just the overview, if you get stuck on a particular part, post your code, tell us what it should do, what it's doing wrong and we will take a look at the code.

          Comment

          • Puneeth kamath
            New Member
            • Dec 2010
            • 10

            #6
            I tried to read the contents of pdf file,but its throwing invalid exception at "ContentScanner .TextWrapper text = (ContentScanner .TextWrapper)le vel.CurrentWrap per;" below...


            Code:
            using System;
            using System.Collections.Generic;
            using System.Linq;
            using System.Text;
            using it.stefanochizzolini.clown.documents;
            using it.stefanochizzolini.clown.files;
            using it.stefanochizzolini.clown.documents.contents;
            using it.stefanochizzolini.clown.documents.contents.objects;
            using it.stefanochizzolini.clown.tools;
            using it.stefanochizzolini.clown.documents.contents.composition;
            using it.stefanochizzolini.clown.documents.contents.fonts;
            namespace ConsoleApplication1
            {
                class Program
                {
                    static void Main(string[] args)
                    {
                        string filePath = @"C:\Documents and Settings\XML\Desktop\Copyright.pdf";
            
                        File file;
                        Document document;
                        try
                        {
                            // Open the PDF file!
                            file = new File(filePath);
            
                            // Get the PDF document!
                            document = file.Document;
            
                        }
                        catch
                        {
                            Console.WriteLine("Sorry, Some Errors in File");
                            for (; ; )
                            {
                                if (Console.ReadLine() == "")
                                    break;
                            }
                            return;
                        }
            
                        //Page stamper is used to draw contents on existing pages.
                        PageStamper stamper = new PageStamper();
            
            
                        foreach (Page page in document.Pages)
                        {
                            Console.WriteLine("\nScanning page " + (page.Index + 1) + "...\n");
            
                            stamper.Page = page;
            
                            // Wraps the page contents into a scanner.
                            Extract(new ContentScanner(page), stamper.Foreground);
            
                            stamper.Flush();
                        }
            
                        for (; ; )
                        {
                            if (Console.ReadLine() == "")
                                break;
                        }
                    }
            
            
                    private static void Extract(ContentScanner level, PrimitiveFilter builder)
                    {
                        if (level == null)
                            return;
            
                        while (level.MoveNext())
                        {
                            ContentObject content = level.Current;
                            if (content is Text)
                            {
                                ContentScanner.TextWrapper text = (ContentScanner.TextWrapper)level.CurrentWrapper;
                                //ContentScanner.GraphicsState test = level.getState();
                                foreach (ContentScanner.TextStringWrapper textString in text.TextStrings)
                                {
                                    Console.WriteLine("Text [font size: " + textString.Style.FontSize + " ], [font Name: " +
                                        textString.Style.Font.Name + " ]: " + textString.Text);
                                }
                            }
                            else if (content is ShowText)
                            {
                                Font font = level.State.Font;
                                Console.WriteLine(font.Decode(((ShowText)content).Text));
                            }
                            else if (content is ContainerObject)
                            {
                                // Scan the inner level!
                                Extract(level.ChildLevel, builder);
                            }
            
                        }
            
                    }
            
                }
            }
            Last edited by Curtis Rutland; Jan 19 '11, 04:41 PM.

            Comment

            • Rabbit
              Recognized Expert MVP
              • Jan 2007
              • 12517

              #7
              Please use code tags. What's the rest of the error text? It looks like you're using some custom library. It would be hard for someone to figure out what's going on with the Library unless they've used it before.

              Comment

              • Puneeth kamath
                New Member
                • Dec 2010
                • 10

                #8
                1. Extract the text and replace special characters with Unicode entities and wrap the content with style information (Use stack data structure to store font information of chunk of strings of paragraphs).
                2. List all styles used.
                3. List Tags using DTD or Schema.
                4. Map styles to Tags or open any saved template of mapped styles-tags.
                5. Validate the mapping process using DTD or Schema.
                6. Save the mapped styles to tag as template.
                7. Convert the Content wrapped with style information into Content wrapped with tags according to mapped styles-tags.

                Here are the modules i wrote.,i did the 2nd module...now can you help me to do the 3rd module?..how to list the tags?

                Comment

                Working...