Quickly finding duplicates in an ArrayList

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • paradox

    Quickly finding duplicates in an ArrayList

    Basically I have an ArrayList of strings; I need a fast way of
    searching through and comparing the elements of the ArrayList and the
    return an ArrayList of items that have 3 Duplicates.

    For example, if the ArrayList has the following elements:

    elems[0] = "blue"
    elems[1] = "red"
    elems[2] = "blue"
    elems[3] = "green"
    elems[4] = "red"
    elems[5] = "red"

    I want to have a function that goes through the list and then returns a
    new arraylist like:

    newElems[0] = "red"
    newElems[1] = "red"
    newElems[2] = "red"

    In actuality my arraylist is an array of objects, which have a string
    property and I'm referencing like: elems[0].filename = "red"; but the
    above example is easier to follow.

    I would like to be able to perform this search as fast as possible. I
    personally don't have any real experience or familiarity with topics
    such as binary trees, etc; would something like that be the route to
    take? Thanks.

  • Philipp Schumann

    #2
    Re: Quickly finding duplicates in an ArrayList

    I know of two routes that you can follow.

    The first - look at the ArrayList.Index Of method, which is probably the
    quickest way to find a certain item in the underlying array. I believe there
    is an overload to specify at which index to start the search.

    Secondly, if you create the arraylist yourself (not some other 3rd-party
    code) you could use ArrayList.Conta ins() when populating the list to ensure
    the duplicates don't get into the list in the first place.

    My "two" euro cents. :)

    "paradox" <dev@demiurgein c.com> schrieb im Newsbeitrag
    news:1109177515 .616245.237420@ o13g2000cwo.goo glegroups.com.. .[color=blue]
    > Basically I have an ArrayList of strings; I need a fast way of
    > searching through and comparing the elements of the ArrayList and the
    > return an ArrayList of items that have 3 Duplicates.
    >
    > For example, if the ArrayList has the following elements:
    >
    > elems[0] = "blue"
    > elems[1] = "red"
    > elems[2] = "blue"
    > elems[3] = "green"
    > elems[4] = "red"
    > elems[5] = "red"
    >
    > I want to have a function that goes through the list and then returns a
    > new arraylist like:
    >
    > newElems[0] = "red"
    > newElems[1] = "red"
    > newElems[2] = "red"
    >
    > In actuality my arraylist is an array of objects, which have a string
    > property and I'm referencing like: elems[0].filename = "red"; but the
    > above example is easier to follow.
    >
    > I would like to be able to perform this search as fast as possible. I
    > personally don't have any real experience or familiarity with topics
    > such as binary trees, etc; would something like that be the route to
    > take? Thanks.
    >[/color]


    Comment

    • Mike McIntyre [MVP]

      #3
      Re: Quickly finding duplicates in an ArrayList

      Though it is written in VB.NET, the demo project at the link below is easy
      to follow and should be helpful.


      "paradox" <dev@demiurgein c.com> wrote in message
      news:1109177515 .616245.237420@ o13g2000cwo.goo glegroups.com.. .[color=blue]
      > Basically I have an ArrayList of strings; I need a fast way of
      > searching through and comparing the elements of the ArrayList and the
      > return an ArrayList of items that have 3 Duplicates.
      >
      > For example, if the ArrayList has the following elements:
      >
      > elems[0] = "blue"
      > elems[1] = "red"
      > elems[2] = "blue"
      > elems[3] = "green"
      > elems[4] = "red"
      > elems[5] = "red"
      >
      > I want to have a function that goes through the list and then returns a
      > new arraylist like:
      >
      > newElems[0] = "red"
      > newElems[1] = "red"
      > newElems[2] = "red"
      >
      > In actuality my arraylist is an array of objects, which have a string
      > property and I'm referencing like: elems[0].filename = "red"; but the
      > above example is easier to follow.
      >
      > I would like to be able to perform this search as fast as possible. I
      > personally don't have any real experience or familiarity with topics
      > such as binary trees, etc; would something like that be the route to
      > take? Thanks.
      >[/color]


      Comment

      • paradox

        #4
        Re: Quickly finding duplicates in an ArrayList

        Well... actually the point of the code is to find the duplicates and
        then display them to users. To give you a much better understanding,
        the program I am creating is for determining conflicts in the pk3 (zip)
        files used in Quake 3. As people develop new maps, they create and use
        existing map elements (textures, shaders, etc) and they are all stored
        in these pk3 files. The problem that occurs is that they create their
        own version of certain shaders or textures without renaming or changing
        the filename directory structure; and this causes the game to fail to
        load properly. Specifically the problem seems to be associated to
        occasions when there are 3 pk3 files with a shader element that has the
        same name but different file size.

        I've gone through the pk3 (zip) files and populated an arraylist called
        conlficts with a object of type element (a class I've defined).

        I'm trying to provide a report that will go thru this list and return
        these elements (and their associated pk3 files). So that the end user
        can easily determine which files that they need to remove in order to
        be able to connect successfully.

        Using IndexOf isn't much different than the slow 3 nested loop approach
        I'm trying to avoid.

        Comment

        • Kenny

          #5
          Re: Quickly finding duplicates in an ArrayList

          I would iterate through the ArrayList and for each item, add it to a
          HashTable if it does not already exist in the HashTable and set its
          value to 1. If it already exists, increment the value of that item in
          the HashTable.

          Once you've gone through the ArrayList, iterate through the HashTable
          to see which items have a value greater than 1.

          It's heavy on memory, but runs in linear time. Should be much better
          than 3 nested loops.

          Comment

          • Brian Gideon

            #6
            Re: Quickly finding duplicates in an ArrayList

            Walk through the array list and add each item to a hashtable that
            stores the number of occurrences of that item. Since hashtable inserts
            and searches have O(1) time complexity the entire operation would be
            O(n), which is good.

            Hashtable table = new Hashtable();
            foreach (string item in elems)
            {
            if (!table.Contain s(item))
            {
            table.Add(item, 0);
            }

            int count = (int)table[item] + 1;
            table[item] = count;
            if (count == 2)
            {
            // Duplicate items are identified here.
            }
            }

            Brian

            paradox wrote:[color=blue]
            > Basically I have an ArrayList of strings; I need a fast way of
            > searching through and comparing the elements of the ArrayList and the
            > return an ArrayList of items that have 3 Duplicates.
            >
            > For example, if the ArrayList has the following elements:
            >
            > elems[0] = "blue"
            > elems[1] = "red"
            > elems[2] = "blue"
            > elems[3] = "green"
            > elems[4] = "red"
            > elems[5] = "red"
            >
            > I want to have a function that goes through the list and then returns
            > a new arraylist like:
            >
            > newElems[0] = "red"
            > newElems[1] = "red"
            > newElems[2] = "red"
            >
            > In actuality my arraylist is an array of objects, which have a string
            > property and I'm referencing like: elems[0].filename = "red"; but the
            > above example is easier to follow.
            >
            > I would like to be able to perform this search as fast as possible. I
            > personally don't have any real experience or familiarity with topics
            > such as binary trees, etc; would something like that be the route to
            > take? Thanks.[/color]

            Comment

            • Jason Black [MSFT]

              #7
              Re: Quickly finding duplicates in an ArrayList

              This is almost what I was going to suggest, but with the difference that
              since you need to track the original ArrayList indexes of the duplicate
              items, you should create new ArrayLists and store them in the hashtable.
              Something like:

              Hashtable h = new Hashtable();
              for(i=0; i < MainArrayList.C ount; i++) {
              // get a key (name, whatever) in some way that's appropriate for your
              objects:
              someKeyType key = GetKeyFromObjec t(MainArrayList[i]);
              // make a new arraylist if necessary
              if(h[key] == null) {
              h[key] = new ArrayList();
              }
              // store the index of this item in the arraylist
              ((ArrayList)h[key]).Add(i);
              }
              foreach(someKey Type key in h.Keys) {
              if( ((ArrayList)h[key]).Count > 1) {
              // do whatever you need to do to show the duplicates to the user.
              // ((ArrayList)h[key]) contains the indexes of the duplicate items.
              }
              }

              Again, harsh on memory, but fast.


              "Kenny" <kvu@thereturne xchange.com> wrote in message
              news:1109181246 .839986.293450@ f14g2000cwb.goo glegroups.com.. .[color=blue]
              >I would iterate through the ArrayList and for each item, add it to a
              > HashTable if it does not already exist in the HashTable and set its
              > value to 1. If it already exists, increment the value of that item in
              > the HashTable.
              >
              > Once you've gone through the ArrayList, iterate through the HashTable
              > to see which items have a value greater than 1.
              >
              > It's heavy on memory, but runs in linear time. Should be much better
              > than 3 nested loops.
              >[/color]


              Comment

              • paradox

                #8
                Re: Quickly finding duplicates in an ArrayList

                Very nice Jason, that is exactly what I needed.

                Thank you all very much!

                Comment

                • paradox

                  #9
                  Re: Quickly finding duplicates in an ArrayList

                  Out of curiousity.. how exactly does a Hashtable work? Why is it so
                  fast?

                  Comment

                  • Brian Gideon

                    #10
                    Re: Quickly finding duplicates in an ArrayList

                    Google for it first. There's a lot of information available that does
                    a better job of explaining it than I could do. If there is still
                    something you don't quite understand go ahead and post back and we'll
                    be more than happy to clarify something.

                    Brian

                    paradox wrote:[color=blue]
                    > Out of curiousity.. how exactly does a Hashtable work? Why is it so
                    > fast?[/color]

                    Comment

                    • paradox

                      #11
                      Re: Quickly finding duplicates in an ArrayList

                      Yeah, I did google it immediately after posting the question and was
                      able to find sufficient information. If I could of deleted the post, I
                      would have. ;)

                      I do have a small question though. Does the .NET Hashtable work in
                      exactly the same manner as a general Hashtable? Also, I tried using
                      reflector to track down the GetHashCode function to see how it worked;
                      I ended up at the String class (taking into consideration the type of
                      key I was using was a string), and the code I found for the GetHashCode
                      function was:

                      [MethodImpl(Meth odImplOptions.I nternalCall)]
                      public override extern int GetHashCode();

                      I couldn't figure out where to find anymore info...

                      Comment

                      • Brian Gideon

                        #12
                        Re: Quickly finding duplicates in an ArrayList

                        The GetHashCode method is a virtual member of an Object so everything
                        has a default implementation for it. The string's version of the
                        method is internal to the .NET framework. That's why you can't see it.
                        One of the most frequently used ways of getting the hash code of an
                        object is to XOR all of the object's fields. In the case of a string
                        you might XOR all characters together. I seriously doubt that's what
                        the framework is doing though. XOR works well because it is fast and
                        produces a very random distribution of values. That's important when
                        it comes time to map the hash code to a slot in the memory array. A
                        good distribution of hash codes lowers the probability that two keys
                        will get mapped to the same slot in memory. This is called a
                        collision. There are many strategies for handling collisions. I
                        believe the Hashtable in dotnet uses quadratic probing. This strategy
                        involves number theory and the use of prime numbers.

                        paradox wrote:[color=blue]
                        > Yeah, I did google it immediately after posting the question and was
                        > able to find sufficient information. If I could of deleted the post,
                        > I would have. ;)
                        >
                        > I do have a small question though. Does the .NET Hashtable work in
                        > exactly the same manner as a general Hashtable? Also, I tried using
                        > reflector to track down the GetHashCode function to see how it
                        > worked; I ended up at the String class (taking into consideration
                        > the type of key I was using was a string), and the code I found for
                        > the GetHashCode function was:
                        >
                        > [MethodImpl(Meth odImplOptions.I nternalCall)]
                        > public override extern int GetHashCode();
                        >
                        > I couldn't figure out where to find anymore info...[/color]

                        Comment

                        Working...