.doc to html and pdf conversion with python

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • Alexander Klingenstein

    .doc to html and pdf conversion with python

    I need to take a bunch of .doc files (word 2000) which have a little text including some tables/layout and mostly pictures and comvert them to a pdf and extract the text and images separately too. If I have a pdf, I can do create the html with pdftohtml called from python with popen. However I need an automated way to converst the .doc to PDF first.

    Is there a way to do what I want either with a python lib, 3rd party app, or maybe remote controlling Word (a la VBA) by "printing" to PDF with a distiller?
    I already tried wvware from gwnuwin32, however it has problems with big image files embedded in .doc file(looks like a mmap error).

    Alex

    _______________ _______________ _______________ _______________ __________
    XXL-Speicher, PC-Virenschutz, Spartarife & mehr: Nur im WEB.DE Club!
    Jetzt gratis testen! http://freemail.web.de/home/landingpad/?mc=021130

  • Luap777@gmail.com

    #2
    Re: .doc to html and pdf conversion with python

    Alexander Klingenstein wrote:
    I need to take a bunch of .doc files (word 2000) which have a little textincluding some tables/layout and mostly pictures and comvert them to a pdfand extract the text and images separately too. If I have a pdf, I can do create the html with pdftohtml called from python with popen. However Ineed an automated way to converst the .doc to PDF first.
    Is there some reason you really want to convert to PDF first? You can
    get much better HTML right from the Word doc. You'll lose a lot of info
    going from PDF to HTML.

    Something like this can open doc in Word, save as HTML, then close doc.

    import os, win32com.client

    wdApp = win32com.client .Dispatch("Word .Application")
    wdApp.Visible = 1

    def SaveDocAsHTML(d ocPath, htmlPath):
    doc = wdApp.Documents .Open(docPath)
    # See
    mk:@MSITStore:C :\Program%20Fil es\Microsoft%20 Office\OFFICE11 \1033\VBAWD10.C HM::/html/womthSaveAs1.ht m
    # in Word VBA help doc for more info.

    # Saves all text and formatting with HTML tags so that the
    resulting document can be viewed in a Web browser.
    doc.SaveAs(html Path, win32com.client .constants.wdFo rmatHTML)
    # Saves text with HTML tags with minimal cascading style sheet
    formatting. The resulting document can be viewed in a Web browser.
    #doc.SaveAs(htm lPath,
    win32com.client .constants.wdFo rmatFilteredHTM L)
    doc.Close()

    And if you aren't satisfied with the ugly HTML you're likely to get,
    you can try running µTidylib (http://utidylib.berlios.de/) on the
    output after this step also.

    Thank you,
    Paul

    Comment

    • Eric_Dexter@msn.com

      #3
      Re: .doc to html and pdf conversion with python

      google won't do a good job with .doc files but they may do pdf to html
      and back.. It's per each I just mentioned it to make fun of them here
      is my resume converted from a monster.com .doc file

      Textverarbeitung, Präsentationen und Tabellen im Web



      Luap777@gmail.c om wrote:
      Alexander Klingenstein wrote:
      I need to take a bunch of .doc files (word 2000) which have a little text including some tables/layout and mostly pictures and comvert them to a pdf and extract the text and images separately too. If I have a pdf, I cando create the html with pdftohtml called from python with popen. HoweverI need an automated way to converst the .doc to PDF first.
      >
      Is there some reason you really want to convert to PDF first? You can
      get much better HTML right from the Word doc. You'll lose a lot of info
      going from PDF to HTML.
      >
      Something like this can open doc in Word, save as HTML, then close doc.
      >
      import os, win32com.client
      >
      wdApp = win32com.client .Dispatch("Word .Application")
      wdApp.Visible = 1
      >
      def SaveDocAsHTML(d ocPath, htmlPath):
      doc = wdApp.Documents .Open(docPath)
      # See
      mk:@MSITStore:C :\Program%20Fil es\Microsoft%20 Office\OFFICE11 \1033\VBAWD10.. CHM::/html/womthSaveAs1.ht m
      # in Word VBA help doc for more info.
      >
      # Saves all text and formatting with HTML tags so that the
      resulting document can be viewed in a Web browser.
      doc.SaveAs(html Path, win32com.client .constants.wdFo rmatHTML)
      # Saves text with HTML tags with minimal cascading style sheet
      formatting. The resulting document can be viewed in a Web browser.
      #doc.SaveAs(htm lPath,
      win32com.client .constants.wdFo rmatFilteredHTM L)
      doc.Close()
      >
      And if you aren't satisfied with the ugly HTML you're likely to get,
      you can try running µTidylib (http://utidylib.berlios.de/) on the
      output after this step also.

      Thank you,
      Paul

      Comment

      Working...