Hello Python Community,
It'd be great if someone could provide guidance or sample code for
accomplishing the following:
I have a single unicode file that has descriptions of hundreds of
objects. The file fairly resembles HTML-EXAMPLE pasted below.
I need to parse the file in such a way to extract data out of the html
and to come up with a tab separated file that would look like OUTPUT-
FILE below.
Any tips, advice and guidance is greatly appreciated.
Thanks,
Egon
=====OUTPUT-FILE=====
/please note that the first line of the file contains column headers/
------Tab Separated Output File Begin------
H1 H2 DIV Segment1 Segment2 Segment3
RoséH1-1 RoséH2-1 RoséDIV-1 RoséSegmentDIV1-1 RoséSegmentDIV2-1
RoséSegmentDIV3-1
PinkH1-2 PinkH2-2 PinkDIV2-2 PinkSegmentDIV1-2 No-Value No-Value
BlackH1-3 BlackH2-3 BlackDIV2-3 BlackSegmentDIV 1-3 No-Value No-Value
YellowH1-4 YellowH2-4 YellowDIV2-4 YellowSegmentDI V1-4
YellowSegmentDI V2-4 No-Value
------Tab Separated Output File End------
=====HTML-EXAMPLE=====
------HTML Example Begin------
<html>
<h1>RoséH1-1</h1>
<h2>RoséH2-1</h2>
<div>RoséDIV-1</div>
<div "segment1">Rosé SegmentDIV1-1</div><br>
<div "segment2">Rosé SegmentDIV2-1</div><br>
<div "segment3">Rosé SegmentDIV3-1</div><br>
<br>
<br>
<h1>PinkH1-2</h1>
<h2>PinkH2-2</h2>
<div>PinkDIV2-2</div>
<div "segment1">Pink SegmentDIV1-2</div><br>
<br>
<comment></comment>
<h1>BlackH1-3</h1>
<h2>BlackH2-3</h2>
<div>BlackDIV 2-3</div>
<div "segment1">Blac kSegmentDIV1-3</div><br>
<h1>YellowH1-4</h1>
<h2>YellowH2-4</h2>
<div>YellowDI V2-4</div>
<div "segment1">Yell owSegmentDIV1-4</div><br>
<div "segment2">Yell owSegmentDIV2-4</div><br>
</html>
------HTML Example End------
It'd be great if someone could provide guidance or sample code for
accomplishing the following:
I have a single unicode file that has descriptions of hundreds of
objects. The file fairly resembles HTML-EXAMPLE pasted below.
I need to parse the file in such a way to extract data out of the html
and to come up with a tab separated file that would look like OUTPUT-
FILE below.
Any tips, advice and guidance is greatly appreciated.
Thanks,
Egon
=====OUTPUT-FILE=====
/please note that the first line of the file contains column headers/
------Tab Separated Output File Begin------
H1 H2 DIV Segment1 Segment2 Segment3
RoséH1-1 RoséH2-1 RoséDIV-1 RoséSegmentDIV1-1 RoséSegmentDIV2-1
RoséSegmentDIV3-1
PinkH1-2 PinkH2-2 PinkDIV2-2 PinkSegmentDIV1-2 No-Value No-Value
BlackH1-3 BlackH2-3 BlackDIV2-3 BlackSegmentDIV 1-3 No-Value No-Value
YellowH1-4 YellowH2-4 YellowDIV2-4 YellowSegmentDI V1-4
YellowSegmentDI V2-4 No-Value
------Tab Separated Output File End------
=====HTML-EXAMPLE=====
------HTML Example Begin------
<html>
<h1>RoséH1-1</h1>
<h2>RoséH2-1</h2>
<div>RoséDIV-1</div>
<div "segment1">Rosé SegmentDIV1-1</div><br>
<div "segment2">Rosé SegmentDIV2-1</div><br>
<div "segment3">Rosé SegmentDIV3-1</div><br>
<br>
<br>
<h1>PinkH1-2</h1>
<h2>PinkH2-2</h2>
<div>PinkDIV2-2</div>
<div "segment1">Pink SegmentDIV1-2</div><br>
<br>
<comment></comment>
<h1>BlackH1-3</h1>
<h2>BlackH2-3</h2>
<div>BlackDIV 2-3</div>
<div "segment1">Blac kSegmentDIV1-3</div><br>
<h1>YellowH1-4</h1>
<h2>YellowH2-4</h2>
<div>YellowDI V2-4</div>
<div "segment1">Yell owSegmentDIV1-4</div><br>
<div "segment2">Yell owSegmentDIV2-4</div><br>
</html>
------HTML Example End------
Comment