Delete paragraphs which do not contain specific word

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • Knut Ole
    New Member
    • Mar 2011
    • 70

    Delete paragraphs which do not contain specific word

    I have endless paragraphs of data, in which only 10% are to be kept, rest discarded. Each "entry," ie. paragraph, are of this format:
    Code:
    <parameter> 
    text over several lines sometimes 
    containing key word 
    </parameter>
    I guess it should be possible to find the(lack of) keyword, find previous and next <p...> and delete paragraph? Is this possible in a /bin/bash/ script for linux/unix?

    Hoping for helpful input! Thank you!
  • jabbah
    New Member
    • Nov 2007
    • 63

    #2
    feels to me as if this would be tough in bash, but i guess it should be doable in perl. just as a rough concept:
    read the file line by line and store the current paragraph in some temp variable and check for the keyword. once the paragraph has ended either discard it or print it

    Comment

    • no2pencil
      New Member
      • Mar 2012
      • 4

      #3
      You can use grep to check if the word is in the file. What I would do is split each paragraph into it's own file, & then find only the files you wish to keep.

      This quick script will point out any line that does not start with a letter. Feel free to edit it as need be.

      Code:
      #/bin/sh
      
      file=test.txt
      
      keep=`cat ${file} | grep -inv "^[a-z]"`
      for line in ${keep}
      do
        echo Line number ${line} can be ignored
      done
      The arguments passed to grep are i for ignore case, n for display number & v for ignore results. Mixed with ^, this line will ignore any line that begins with a character (ignoring case) of a through z. It then passes the line number of any output that doesn't meet that requirement.

      You should then be able to cat the file, search for the line numbers not in that line set, output the contents to individual files passed through grep of the keyword, & you have files of each paragraphs with your chosen key word.

      Comment

      Working...