string dictionary and memory issue.

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • Alexandre Brisebois (www.pointnetsolutions.com)

    string dictionary and memory issue.

    Hi,

    I am currently building a lexical analysis component to pull keywords
    out of content,
    I currently have a functional first build, but I am having problems
    ince I am easily loading over 300 000 strings in memory,

    when I am doing the actual analysis I can reach upto 400 mb of ram
    usage.

    I currently have built my dictionary out of a tree built by nodes
    containing hashtables. each node represents a letter of the string and
    a flag representing the end of a string.

    I thought of building such a structure for speed. The reading speed of
    the tree, and loading speed is
    remarkably fast, but the cost to the system is greater then I had
    expected.

    Could anyone suggest a solution to this ?
    should I store my node data as compressed data?
    Best Regards,
    Alexandre Brisebois

  • Jon Skeet [C# MVP]

    #2
    Re: string dictionary and memory issue.

    Alexandre Brisebois (www.pointnetsolutions.com) wrote:[color=blue]
    > I am currently building a lexical analysis component to pull keywords
    > out of content,
    > I currently have a functional first build, but I am having problems
    > ince I am easily loading over 300 000 strings in memory,
    >
    > when I am doing the actual analysis I can reach upto 400 mb of ram
    > usage.
    >
    > I currently have built my dictionary out of a tree built by nodes
    > containing hashtables. each node represents a letter of the string and
    > a flag representing the end of a string.[/color]

    How many entries are in each node? I wouldn't expect there'd be that
    many - particularly if you're only dealing with ASCII characters. I'd
    suggest using a plain array, either as a list of entries (for nodes
    without many sub-entries) or an array with nulls in (for those where
    traversing a list would be expensive). You could use ArrayList/List<T>
    instead of the straight arrays, but call TrimToSize on all of them when
    you've finished loading the list of words to avoid wastage.

    Jon

    Comment

    • Alexandre Brisebois (www.pointnetsolutions.com)

      #3
      Re: string dictionary and memory issue.

      well each node has beteew 0 to possible characters

      I tryed loading 117 000~ words in a straight foward hashtable.
      a 4 meg xml file became 25 megs in memory...

      but I still find that extremly large. I am tring to find a way to use
      as little ram as possible,
      without completely hurting my speed.

      I will try with the ArrayList and List generic collections.
      Best Regards,
      Alexandre Brisebois

      If you have any other ideas please do let me know.

      Comment

      • Jon Skeet [C# MVP]

        #4
        Re: string dictionary and memory issue.

        Alexandre Brisebois (www.pointnetsolutions.com) wrote:[color=blue]
        > well each node has beteew 0 to possible characters
        >
        > I tryed loading 117 000~ words in a straight foward hashtable.
        > a 4 meg xml file became 25 megs in memory...[/color]

        Is that the total memory of the process? If so, that's not particularly
        surprising. The framework creates a fair amount of overhead (try
        loading a virtually empty file to see what I mean) and the strings will
        all be converted to Unicode, which will double the size of an ASCII
        file.
        [color=blue]
        > but I still find that extremly large. I am tring to find a way to use
        > as little ram as possible, without completely hurting my speed.[/color]

        Do you actually have a concrete requirement with respect to memory? On
        most modern systems, 25MB is very little. One can waste a lot of time
        going for the "best possible" performance instead of performance which
        is "good enough".

        Jon

        Comment

        • Alexandre Brisebois (www.pointnetsolutions.com)

          #5
          Re: string dictionary and memory issue.

          Well I want to try and have this run on a shared hosting machine.

          So I do wish to find a way to reduce the amount of memory use.
          Though the more I look into this, the more I underant that I will have
          to build a distributed system.

          Im still not sure how im going to have to structure it all.

          so far my dictionary files contain roufly 200 000 words and is growing
          I also built a Thesaurus ADT so then we are talking about an ther easy
          200 000 words ( interlinked by references not copies )

          and the system continiously identifies unknown words which need to be
          reviewed.

          so all this is taking up a lot of resources as it is loaded.

          I have never done lexical analysis before, so this stuff is all new to
          me and I have not found the best strategy so far.
          so I keep looking and trying different things.

          Using a database would make everything much simpler,
          but I dont have that luxury.

          that is the main reason I am looking for some type of strategy to build
          either partial or full thesaurus and dictionary ADTs
          I currently am looking at the partial loading option, where I would
          load it up in a graph containing only the words it needs
          for a particular analysis and unloading afterwards.

          also looking to maybe merge the thesaurus with the dictionary. but I
          and trying to find anything else that will prevent me from doing so.

          Best Regards,
          Alexandre Brisebois

          Comment

          Working...