parsing MS word docs -- tutorial request

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • bp.tralfamadore@gmail.com

    parsing MS word docs -- tutorial request

    All,

    I am trying to write a script that will parse and extract data from a
    MS Word document. Can / would anyone refer me to a tutorial on how to
    do that? (perhaps from tables). I am aware of, and have downloaded
    the pywin32 extensions, but am unsure of how to proceed -- I'm not
    familiar with the COM API for word, so help for that would also be
    welcome.

    Any help would be appreciated. Thanks for your attention and
    patience.

    ::bp::
  • Okko Willeboordsed

    #2
    Re: parsing MS word docs -- tutorial request

    Get a copy of; Python Programming on Win32, ISBN 1-56592-621-8
    Use Google and VBA for help

    bp.tralfamadore @gmail.com wrote:
    All,
    >
    I am trying to write a script that will parse and extract data from a
    MS Word document. Can / would anyone refer me to a tutorial on how to
    do that? (perhaps from tables). I am aware of, and have downloaded
    the pywin32 extensions, but am unsure of how to proceed -- I'm not
    familiar with the COM API for word, so help for that would also be
    welcome.
    >
    Any help would be appreciated. Thanks for your attention and
    patience.
    >
    ::bp::

    Comment

    • Mike Driscoll

      #3
      Re: parsing MS word docs -- tutorial request

      On Oct 29, 4:32 am, Okko Willeboordsed <okko.willeboor ...@gmail.com>
      wrote:
      Get a copy of;  Python Programming on Win32, ISBN 1-56592-621-8
      Use Google and VBA for help
      >
      bp.tralfamad... @gmail.com wrote:
      All,
      >
      I am trying to write a script that will parse and extract data from a
      MS Word document.  Can / would anyone refer me to a tutorial on how to
      do that?  (perhaps from tables).  I am aware of, and have downloaded
      the pywin32 extensions, but am unsure of how to proceed -- I'm not
      familiar with the COM API for word, so help for that would also be
      welcome.
      >
      Any help would be appreciated.  Thanks for your attention and
      patience.
      >
      ::bp::
      Also check out MSDN as the win32 module is a thin wrapper so most of
      the syntax on MSDN or in VB examples can be directly translated to
      Python. There's also a PyWin32 mailing list which is quite helpful:



      Mike

      Comment

      • Reedick, Andrew

        #4
        RE: parsing MS word docs -- tutorial request

        -----Original Message-----
        From: python-list-bounces+jr9445= att.com@python. org [mailto:python-
        list-bounces+jr9445= att.com@python. org] On Behalf Of
        bp.tralfamadore @gmail.com
        Sent: Tuesday, October 28, 2008 10:26 AM
        To: python-list@python.org
        Subject: parsing MS word docs -- tutorial request

        All,

        I am trying to write a script that will parse and extract data from a
        MS Word document. Can / would anyone refer me to a tutorial on how to
        do that? (perhaps from tables). I am aware of, and have downloaded
        the pywin32 extensions, but am unsure of how to proceed -- I'm not
        familiar with the COM API for word, so help for that would also be
        welcome.

        Any help would be appreciated. Thanks for your attention and
        patience.

        ::bp::
        --
        http://mail.python.org/mailman/listinfo/python-list

        Word Object Model:
        Gain technical skills through documentation and training, earn certifications and connect with the community


        Google for sample code to get you started.


        Comment

        • Kay Schluehr

          #5
          Re: parsing MS word docs -- tutorial request

          On 28 Okt., 15:25, bp.tralfamad... @gmail.com wrote:
          All,
          >
          I am trying to write a script that will parse and extract data from a
          MS Word document.  Can / would anyone refer me to a tutorial on how to
          do that?  (perhaps from tables).  I am aware of, and have downloaded
          the pywin32 extensions, but am unsure of how to proceed -- I'm not
          familiar with the COM API for word, so help for that would also be
          welcome.
          >
          Any help would be appreciated.  Thanks for your attention and
          patience.
          >
          ::bp::
          One can convert MS-Word documents into some class of XML documents
          called MHTML. If I remember correctly those documents had an .mht
          extension. The result is a huge amount of ( nevertheless structured )
          markup gibberish together with text. If one spends time and attention
          one can find pattern in the markup ( we have XML and it's human
          readable ).

          A few years ago I used this conversion to implement roughly following
          thing algorithm:

          1. I manually highlighted one or more sections in a Word doc using a
          background colour marker.
          2. I searched for the colour marked section and determined the
          structure. The structure information was fed into a state machine.
          3. With this state machine I searched for all sections that were
          equally structured.
          4. I applied a href link to the text that was surrounded by the
          structure and removed the colour marker.
          5. In another document I searched for the same text and set an anchor.

          This way I could link two documents ( those were public specifications
          being originally disconnected ).

          Kay

          Comment

          • Terry Reedy

            #6
            Re: parsing MS word docs -- tutorial request

            Kay Schluehr wrote:
            On 28 Okt., 15:25, bp.tralfamad... @gmail.com wrote:
            >All,
            >>
            >I am trying to write a script that will parse and extract data from a
            >MS Word document. Can / would anyone refer me to a tutorial on how to
            >do that? (perhaps from tables). I am aware of, and have downloaded
            >the pywin32 extensions, but am unsure of how to proceed -- I'm not
            >familiar with the COM API for word, so help for that would also be
            >welcome.
            >>
            >Any help would be appreciated. Thanks for your attention and
            >patience.
            >>
            >::bp::
            >
            One can convert MS-Word documents into some class of XML documents
            called MHTML. If I remember correctly those documents had an .mht
            extension. The result is a huge amount of ( nevertheless structured )
            markup gibberish together with text. If one spends time and attention
            one can find pattern in the markup ( we have XML and it's human
            readable ).
            A related solution is to use OpenOffice to convert to
            OpenDocumentFor mat, a zipped multiple XML format, and then use ODFPY to
            parse the XML and access the contents as linked objects.


            Comment

            • bp.tralfamadore@gmail.com

              #7
              Re: parsing MS word docs -- tutorial request


              Thanks everyone -- very helpful!
              I really appreciate your help -- that is what makes the world a
              wonderful place.

              peace.

              ::bp::

              Comment

              Working...