How to tokenize a collection of text file?

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • 29294
    New Member
    • Aug 2013
    • 8

    How to tokenize a collection of text file?

    I am working on Information Retrieval field.For this project I need to Tokenize a collection of documents such as text files. I have done how to tokenize a string and one text file.but in the text file i am able to tokenize on the whitespace only,not able to work on hyphen or comma etc.So,I need the java code which will actually tokenize the character while getting , or - or ' etc for a collection of text files. pls help pls....
  • chaarmann
    Recognized Expert Contributor
    • Nov 2007
    • 785

    #2
    Just simply replace the whitespace " " in the code with a comma "," etc. to tokenize on other characters.
    But most likely you want all of the mentioned characters together to be a token separator. Then tokenize by using regular Expression:
    Code:
    String tokens[] = textString.split("[\\s\\-,']+".
    If you still have problems, then please show the code you have done so far here, so that we can improve and change it.

    One tip: to get more and faster answers in general do NOT write "pls help pls". For me it makes the impression that you are in a hurry (abbreviation) and will not value any answer. It's clear that we are here to help you, but I feel pressure if you mention this self-speaking fact and emphasize it with two pleas. You are just lucky that I am in a very good mood right now, else I would not have answered you because of this sentence.

    Comment

    • 29294
      New Member
      • Aug 2013
      • 8

      #3
      thank you very much...I am able to tokenize the text file on getting white space and any other punctuation.Now I want to tokenize a collection of text files not only a single text file.
      I am attaching my code here.one error has occured and i am unable to find why actualy it is occuring.
      Attached Files

      Comment

      • chaarmann
        Recognized Expert Contributor
        • Nov 2007
        • 785

        #4
        Ok, here is the code from the file. I put it here directly using code tags instead of a text file, because then it's easier for others to read (and understand and providing help based on the line number). For the same reason, I cleaned up by removing commented-out code and then indented it properly.
        Cleaned-up original code:
        Code:
        package stemmer; 
        import java.util.*;  // Provides TreeMap, Iterator, Scanner  
        import java.io.*;    // Provides FileReader, FileNotFoundException  
        
        public class NewEmpty  
        {  
           public static void main(String[ ] args)  
           {  
        		Scanner br;  
        		   
        		//**READ THE DOCUMENTS**  
        		for (int x=0; x<Docs.length; x++)  
        		{  
        			br = new Scanner(new FileReader(Docs[x]));  				 
        		}
        		
        		try
        		{
        			String strLine= " ";
        			String filedata="";
        			while ( (strLine =br.readLine()) != null)
        			{
        				filedata+=strLine+" ";
        			}
        			StringTokenizer stk=new StringTokenizer(filedata," .,-'");
        			while(stk.hasMoreTokens())
        			{
        			   String token=stk.nextToken();
        			   System.out.println(token);
        			}
        			br.close();
        		}  
        		catch (Exception e)
        		{
        			System.err.println("Error: " + e.getMessage());
        		}
        
        		// Array of documents  
        		String Docs [] = {"words.txt", "words2.txt","words3.txt", "words4.txt",};  
        	} 
        }

        Comment

        • chaarmann
          Recognized Expert Contributor
          • Nov 2007
          • 785

          #5
          First, I wonder how it compiled at all. In line 39, you defined the string-array of docs that you want to loop through in line 12. So it must be defined BEFORE line 12

          Second, you open a text file for reading in your for-loop and assigning it to "br", but instead of parsing its content, you close the for-loop which will assign the next text file and so on, until you assign the last text file and then you parse only this last one. To fix the code, you must do all the parsing (line 17 to 36) inside the for-loop, not outside.

          Third, you have a memory leak. If there occurs an exception while reading a file, you don't close the file, leaving it open forever. You must do your close-command in the "finally" part of your try-catch-command. (Unfortunately the close-command can also throw an error, so it needs a try-catch itself).

          Fourth, the string array should be named "docs" instead of "Docs". Only classes should start with uppercase letters, but instances not. Every professional java programmer follows this coding style for good reasons, which I will not explain further here, because it leads too far away.

          There are some other minor issues and enhancements, but they don't hinder you to get it running, so I will not mention them now.

          Here is the corrected source code. (I cannot try to run it at the moment, but you should do it anyway, so tell me if it's ok now.)
          Code:
          package stemmer; 
          import java.util.*;  // Provides TreeMap, Iterator, Scanner  
          import java.io.*;    // Provides FileReader, FileNotFoundException  
          
          public class NewEmpty  
          {  
             public static void main(String[ ] args)  
             {  
          		// Array of documents  
          		String docs [] = {"words.txt", "words2.txt","words3.txt", "words4.txt",};  
          		   
          		// process all documents  
          		for (int x=0; x<docs.length; x++)  
          		{
          			// read document and parse it
          			Scanner br = new Scanner(new FileReader(docs[x]));  				 
          			try
          			{
          				String strLine= " ";
          				String filedata="";
          				while ( (strLine =br.readLine()) != null)
          				{
          					filedata+=strLine+" ";
          				}
          				StringTokenizer stk=new StringTokenizer(filedata," .,-'");
          				while(stk.hasMoreTokens())
          				{
          				   String token=stk.nextToken();
          				   System.out.println(token);
          				}				
          			}  
          			catch (Exception e)
          			{
          				System.err.println("Error: " + e.getMessage());
          			}
          			finally
          			{
          				try
          				{
          					br.close();
          				}
          				catch (Exception e2)
          				{
          					// NOPMD 
          				}
          			}
          		}
          	} 
          }

          Comment

          • Mousumi Dhar
            New Member
            • Aug 2013
            • 1

            #6
            The following code is working fine in NetBeans for the above problem :D


            Code:
            package FinalizedPrograms;
            import java.io.BufferedReader;
            import java.util.*;  // Provides TreeMap, Iterator, Scanner  
            import java.io.*;    // Provides FileReader, FileNotFoundException  
            
            public class TokenizingMultipleFiles  
            {  
               public static void main(String[ ] args)  
               {  
                 // Scanner br;  
               // Array of documents  
              String Docs [] = {"temp.txt", "temp1.txt",};
            //**FOR LOOP TO READ THE DOCUMENTS**  
            for (int x=0; x<Docs.length; x++)  
            {  
              try  
                  {  
                      File f=new File(Docs[x]);
                      BufferedReader br = new BufferedReader(new FileReader(f));
                     //br = new Scanner(new FileReader(Docs[x]));  
                     try{
            String strLine= " ";
            String filedata="";
            while ( (strLine = br.readLine()) != null)   {
            filedata+=strLine+" ";
            }
            StringTokenizer stk=new StringTokenizer(filedata," .,-'[]{}/|@#!$%^&*_-+=?<>:;()");
               while(stk.hasMoreTokens()){
                   String token=stk.nextToken();
                   System.out.println(token);
               }
               br.close();
               }  
               catch (Exception e){
                 System.err.println("Error: " + e.getMessage());
               }
                     
                  }  
                 catch (FileNotFoundException e)  
                 {  
             System.err.println(e);  
             return;  
                  }  
                 } //End of for loop *]
            
            }  
            }
            Last edited by Mousumi Dhar; Aug 14 '13, 11:11 AM. Reason: Wrong code attached by mistake. Sorry!

            Comment

            • 29294
              New Member
              • Aug 2013
              • 8

              #7
              Code:
              package IR;
              import java.io.BufferedReader;
              import java.util.*;  // Provides TreeMap, Iterator, Scanner  
              import java.io.*;    // Provides FileReader, FileNotFoundException  
              
              public class FilesTokenization 
              {  
                 public static void main(String[ ] args)  
                 {  
                   // Scanner br;  
                 // Array of documents  
                String Docs [] = {"words.txt", "words2.txt","words3.txt", "words4.txt",};
                //start for loop
                for (int x=0; x<Docs.length; x++)  
              {  
                try  
                    {  
                        File f=new File(Docs[x]);
                        BufferedReader br = new BufferedReader(new FileReader(f));
                       //br = new Scanner(new FileReader(Docs[x]));  
                       
              try{
              String strLine= " ";
              String filedata="";
              while ( (strLine = br.readLine()) != null)   
              {
              filedata+=strLine+" ";
              }
              StringTokenizer stk=new StringTokenizer(filedata," .,-';{}?()");
                 while(stk.hasMoreTokens())
                 {
                     String token=stk.nextToken();
                     System.out.println(token);
                 }
                 br.close();
                 }  
                   catch (FileNotFoundException e)  
                   {  
               System.err.println(e);  
               return;  
                    }
                    }
              catch (Exception e){
                   System.err.println("Error: " + e.getMessage());
                }
                    }  
              }  
              }
              Last edited by Rabbit; Aug 14 '13, 03:30 PM. Reason: Please use code tags when posting code or formatted data.

              Comment

              • 29294
                New Member
                • Aug 2013
                • 8

                #8
                thank you very much chaarman for pointing out the faults.and I am able to successfully run program and it gives correct output.the code is mentioned in the above post by me.

                Comment

                • devkumarOO7
                  New Member
                  • Aug 2013
                  • 2

                  #9
                  It is very helpful content for me.
                  Thanks for provide such type of content

                  Comment

                  • vinaykumar1994
                    New Member
                    • Jan 2015
                    • 2

                    #10
                    Hi.Presently I am doing a project on personalized web search which was related to information retrieval concepts like stemming and tokenization. Can any one help me in providing the related code for my project.

                    Comment

                    • vinaykumar1994
                      New Member
                      • Jan 2015
                      • 2

                      #11
                      Please mail the code for tokenization to me. My mail id: mailmevinay1994 @gmail.com

                      Comment

                      Working...