Slow String operations...

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • Mugunth

    Slow String operations...

    I'm writing a search engine crawler for indexing local files in C#
    My dataset is about 38000 XML files and as of now, I've successfully
    parsed the file, and tokenized it.
    But, it's surprising to find that, string operations gradually
    becoming slower...
    The system crunches 8200 files in the first 10 seconds, but is able to
    do only 5000 in the next 10, and then 3500 in the next 10 and it
    reduces gradually...
    It takes about 75 seconds totally for 38000 files, whereas if the
    system had proceeded at the speed with which it started, it should
    have taken under 50 seconds...
    Why is string operations become progressively slow?

    This is my output...
    Total files processed so far: 8201
    Time taken so far (sec):10.001
    Total files processed so far: 13106
    Time taken so far (sec):20.002
    Total files processed so far: 17661
    Time taken so far (sec):30.001
    Total files processed so far: 21926
    Time taken so far (sec):40.002
    Total files processed so far: 26489
    Time taken so far (sec):50.018
    Total files processed so far: 30703
    Time taken so far (sec):60.002
    Total files processed so far: 35479
    Time taken so far (sec):70.017
    Done - 37526 files found!
    Time taken so far (sec):74.883


    Any help appreciated...
    Mugunth
  • Mugunth

    #2
    Re: Slow String operations...


    Thankyou for your answers...
    The call to Tokenize and StripPunctuatio ns are the string operations.
    For the first 10 seconds, they usually tokenize about 8200 files.
    Second 10 seconds they tokenize only 5000 files...
    and third 10 second, it's even lesser....
    Nearly all the files are of the same size... but the algorithm gets
    progressively slower with time...


    This is my strip punctuation code

    char[] punctuations = { '#', '!', '*', '-', '"', ','};
    int len = sbFileContents. Length;
    for (int i = 0 ; i < len; i ++)
    {
    if (sbFileContents[i].CompareTo(punc tuations[0]) ==
    0||
    sbFileContents[i].CompareTo(punc tuations[1]) == 0
    ||
    sbFileContents[i].CompareTo(punc tuations[2]) == 0
    ||
    sbFileContents[i].CompareTo(punc tuations[3]) == 0
    ||
    sbFileContents[i].CompareTo(punc tuations[4]) == 0
    ||
    sbFileContents[i].CompareTo(punc tuations[5]) ==
    0)
    {
    sbFileContents[i] = ' ';
    }
    }

    this is my tokenize code...
    string[] returnArray;
    string[] delimiters = { " ", "?", ". " };
    int count = 0;
    string[] strArray = fileContents.To String().
    Split(delimiter s,
    StringSplitOpti ons.RemoveEmpty Entries);

    returnArray = new string[strArray.Length];

    PorterStemmer ps = new PorterStemmer() ;
    foreach (String str in strArray)
    {
    string word;
    if (bStem)
    {
    word = ps.stemTerm(str );
    }
    else
    {
    word = str;
    }

    if(!IsStopWord( word))
    returnArray[count++] = word;
    }


    return returnArray;


    Is it like, as time progresses, the number of Garbage collection calls
    are higher and because of that overhead my performance is hampered
    over time?
    Is there any way to set the size of the heap at program start?

    Regards,
    Mugunth

    Comment

    • Bill Butler

      #3
      Re: Slow String operations...


      "Mugunth" <mugunth.kumar@ gmail.comwrote in message news:64c0c16d-ff37-47a3-8c39-9ca44c2356b1@d2 1g2000prf.googl egroups.com...
      I'm writing a search engine crawler for indexing local files in C#
      My dataset is about 38000 XML files and as of now, I've successfully
      parsed the file, and tokenized it.
      But, it's surprising to find that, string operations gradually
      becoming slower...
      The system crunches 8200 files in the first 10 seconds, but is able to
      do only 5000 in the next 10, and then 3500 in the next 10 and it
      reduces gradually...
      I sugest that you recheck your math
      From your output I get the following

      time total dif
      10 8201 8201
      20 13160 4905
      30 17661 4555
      40 21926 4265
      50 26489 4563
      60 30703 4214
      70 35479 4776

      Besides the first data point, it looks quite linear.
      If you calculate the number of files processed in each 10 sec interval it ranges from ~4200-4900 with no noticable dropoff

      I am not sure why the first interval was so much faster, but this is not slowing to a crawl


      It takes about 75 seconds totally for 38000 files, whereas if the
      system had proceeded at the speed with which it started, it should
      have taken under 50 seconds...
      Why is string operations become progressively slow?

      This is my output...
      Total files processed so far: 8201
      Time taken so far (sec):10.001
      Total files processed so far: 13106
      Time taken so far (sec):20.002
      Total files processed so far: 17661
      Time taken so far (sec):30.001
      Total files processed so far: 21926
      Time taken so far (sec):40.002
      Total files processed so far: 26489
      Time taken so far (sec):50.018
      Total files processed so far: 30703
      Time taken so far (sec):60.002
      Total files processed so far: 35479
      Time taken so far (sec):70.017
      Done - 37526 files found!
      Time taken so far (sec):74.883








      Comment

      • Jon Skeet [C# MVP]

        #4
        Re: Slow String operations...

        On Feb 8, 1:15 pm, Mugunth <mugunth.ku...@ gmail.comwrote:

        <snip>
        Is it like, as time progresses, the number of Garbage collection calls
        are higher and because of that overhead my performance is hampered
        over time?
        Possible, but I wouldn't expect that to be the problem.

        Again though, if you could produce a *complete* program it would make
        life a lot easier.
        It doesn't need to look at different files - just going through the
        same file thousands of times should demonstrate the issue given what
        you've been saying.

        Jon

        Comment

        • =?Utf-8?B?RmFtaWx5IFRyZWUgTWlrZQ==?=

          #5
          RE: Slow String operations...

          Bill Butler's answer makes a good point.

          Do the file sizes vary wildly, or are they approximatly the same over the
          sample size? This could account for differences. Also, look at your
          process. If some files have more replacements than others, then the work
          being done in each 10 seconds is not properly counted by file count.

          "Mugunth" wrote:
          I'm writing a search engine crawler for indexing local files in C#
          My dataset is about 38000 XML files and as of now, I've successfully
          parsed the file, and tokenized it.
          But, it's surprising to find that, string operations gradually
          becoming slower...
          The system crunches 8200 files in the first 10 seconds, but is able to
          do only 5000 in the next 10, and then 3500 in the next 10 and it
          reduces gradually...
          It takes about 75 seconds totally for 38000 files, whereas if the
          system had proceeded at the speed with which it started, it should
          have taken under 50 seconds...
          Why is string operations become progressively slow?
          >
          This is my output...
          Total files processed so far: 8201
          Time taken so far (sec):10.001
          Total files processed so far: 13106
          Time taken so far (sec):20.002
          Total files processed so far: 17661
          Time taken so far (sec):30.001
          Total files processed so far: 21926
          Time taken so far (sec):40.002
          Total files processed so far: 26489
          Time taken so far (sec):50.018
          Total files processed so far: 30703
          Time taken so far (sec):60.002
          Total files processed so far: 35479
          Time taken so far (sec):70.017
          Done - 37526 files found!
          Time taken so far (sec):74.883
          >
          >
          Any help appreciated...
          Mugunth
          >

          Comment

          • Jon Skeet [C# MVP]

            #6
            Re: Slow String operations...

            On Feb 8, 4:06 pm, Mugunth <mugunth.ku...@ gmail.comwrote:

            <snip>
            This is the complete source code..
            That's very helpful.
            But dataset is huge.. which I cannot upload...
            Could you give us a single sample file to load 38000 times though?
            Is each file big?

            Jon

            Comment

            • Mugunth

              #7
              Re: Slow String operations...

              <DOC>
              <DOCNOABC199810 02.1830.0000 </DOCNO>
              <DOCTYPEMISCELL ANEOUS </DOCTYPE>
              <TXTTYPECAPTI ON </TXTTYPE>
              <TEXT>
              The troubling connections between the global economic crisis and
              American
              jobs. monica Lewinsky and Linda Tripp, private conversations made
              public. Gene autry had died, the most famous singing cowboy of them
              all. And the artist who sold only one painting in his lifetime and
              is an icon today.
              </TEXT>
              </DOC>

              this is one single file...

              Comment

              • Jon Skeet [C# MVP]

                #8
                Re: Slow String operations...

                On Feb 8, 4:17 pm, Mugunth <mugunth.ku...@ gmail.comwrote:

                <snip>
                Again, the first second it can parse 3200 files...
                the last 5 seconds (30-34) it could parse only 3500 files...
                My data set is not this disparate...
                I would strongly suggest that you modify your code to load a single
                file thousands of times. That way you *know* whether the performance
                is actually degrading or whether it's just different data.

                Jon

                Comment

                • Creativ

                  #9
                  Re: Slow String operations...

                  Just comment different parts which cost most time one by one out. You
                  might find the factor.

                  Comment

                  Working...