Bizarre benchmark result -- C# hundreds of times slower than Java?

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • Michael A. Covington

    Bizarre benchmark result -- C# hundreds of times slower than Java?

    While asking some Java enthusiasts what they think about C#, I came across
    this:



    Reportedly, the (essentially) same program in C# is much, much slower than
    in Java.

    This is a program that is heavy on regexes (which I'm not an expert on) and
    am wondering if the C# version makes an elementary blunder. Do any experts
    want to have a look? See also comp.lang.java.

    (Query: Is he compiling the regex once in Java, but every time through the
    loop in C#?)




  • Shalin Shah

    #2
    Re: Bizarre benchmark result -- C# hundreds of times slower thanJava?

    This is a program that is heavy on regexes (which I'm not an expert on) and
    am wondering if the C# version makes an elementary blunder.  Do any experts
    want to have a look?  See also comp.lang.java.
    >
    (Query: Is he compiling the regex once in Java, but every time through the
    loop in C#?)
    I think he is compiling the regular expression each time in the loop.
    A good benchmark would be compiling it once and matching it in the
    loop. Maybe C# uses a DFA-NFA hybrid (which might explain the large
    memory usage, as the author of the article claims) which has the
    potential of matching a regular expression several times faster than a
    backtracking implementation, but it would compile the regular
    expression slower than backtracking.

    Another good benchmark would include the use of backreferences, which
    forces a regex implementation to use backtracking.

    FYI, egrep uses a DFA-NFA hybrid.

    Comment

    • =?Utf-8?B?RXRoYW4gU3RyYXVzcw==?=

      #3
      Re: Bizarre benchmark result -- C# hundreds of times slower than J

      Thanks for the links! Very helpful and interesting.
      Ethan
      Plenty, usually caused by the fact that regexes aren't regular expressions
      (the theoretical constructs, which always match in linear time). See, e.g.,
      http://www.codinghorror.com/blog/archives/000488.html and
      http://www.regular-expressions.info/catastrophic.html.
      >

      Comment

      • =?Utf-8?B?RXRoYW4gU3RyYXVzcw==?=

        #4
        Re: Bizarre benchmark result -- C# hundreds of times slower than J

        I don't want to be the one making Jesus cry, but I am not sure that my code
        is what is doing it. I see that representing DNA as strings is not going to
        make the cpu as happy as it could be, but it is not obvious to me how to
        represent DNA (and RNA and protein, so I can't use a byte array anymore) as a
        numeric array and still get relatively programmer friendly functionality.
        I started yesterday (code below...) and stopped pretty rapidly because I
        don't see a way to recreate IndexOf or Regex type functionality without a lot
        of work! If you have anything more complete I would be interested!
        Thanks,
        Ethan

        using System;
        using System.Collecti ons.Generic;
        using System.Text;

        namespace TestSequence
        {
        public struct DNA
        {
        private DNABase[] _Sequence;
        public DNA(string sequence)
        {
        List<DNABaseThi sSequence = new List<DNABase>() ;
        foreach (char thisBase in sequence.ToUppe r().ToCharArray ())
        {
        DNABase NextBase;
        switch (thisBase)
        {
        case "G":
        {
        NextBase = DNABase.G;
        break;
        }
        case "A":
        {
        NextBase = DNABase.A;
        break;
        }
        case "T":
        {
        NextBase = DNABase.T;
        break;
        }
        case "C":
        {
        NextBase = DNABase.C;
        break;
        }
        default:
        {
        continue;
        }
        }
        ThisSequence.Ad d(NextBase);
        }
        _Sequence = ThisSequence.To Array();
        }
        }
        public enum DNABase : byte
        {
        N = 0,
        G = 1,
        A = 2,
        T = 3,
        C = 4
        }
        }


        And I hope you realise that using strings to represent DNA sequences
        except for input/output makes baby Jesus cry. Your programs would be
        faster and more maintainable using your own type (probably backed by a
        byte[] and some unsafe code). The best way to represent it normally
        depends on what you're doing and whether you need to consider SNPs, but
        Unicode strings are a bad idea!
        >
        Alun Harford
        >

        Comment

        • Jesse Houwing

          #5
          Re: Bizarre benchmark result -- C# hundreds of times slower than Java?

          Hello Michael,
          While asking some Java enthusiasts what they think about C#, I came
          across this:
          >

          h_cameron
          >
          Reportedly, the (essentially) same program in C# is much, much slower
          than in Java.
          >
          This is a program that is heavy on regexes (which I'm not an expert
          on) and am wondering if the C# version makes an elementary blunder.
          Do any experts want to have a look? See also comp.lang.java.
          >
          (Query: Is he compiling the regex once in Java, but every time through
          the loop in C#?)
          Replacing
          Regex regexpr = new Regex(matchthis , RegexOptions.Co mpiled);

          with
          Regex regexpr = new Regex(matchthis , RegexOptions.No ne);

          made it fly.

          --
          Jesse Houwing
          jesse.houwing at sogeti.nl


          Comment

          Working...