Count all unique elements in an array.

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • Lisa L
    New Member
    • Jul 2011
    • 3

    Count all unique elements in an array.

    Hello everyone,

    I am a biologist that is a relatively new to programming. This isn't for homework, however I would like to efficiently write a piece of a program that finds the number of unique elements in a sorted array. My array has over 200,000+ elements in it, so it would be silly to brute force the whole thing, especially because the array is sorted.

    Example:
    @array = (0,0,0,1,1,1,1, 2,3,3);
    Would return:
    0: 3
    1: 4
    2: 1
    3: 2


    Any suggestions? I don't need the entire code if it is against the rules, but I would like an idea of how to program this.
    Last edited by Lisa L; Jul 26 '11, 07:59 PM. Reason: clarification.
  • Rabbit
    Recognized Expert MVP
    • Jan 2007
    • 12517

    #2
    I'm not sure what you mean by brute force but if it's sorted, you could just start counting and when it changes, insert into a new array with the value and count. In peudocode, it would be
    Code:
    var arr1, arr2, group, count, i
    
    load arr1
    group = arr1(0)
    count = 0
    
    for i = 0 to upperbound(arr1)
       if arr1(i) != group
          insert into arr2 (group, count)
          count = 0
          group = arr1(i)
       end if
    
       count = count + 1
    end for

    Comment

    • Lisa L
      New Member
      • Jul 2011
      • 3

      #3
      Thanks, by brute force I mean nested loops that go through the whole array in order to count.

      Thanks for your post, appreciate it!
      -Lisa

      Comment

      • Rabbit
        Recognized Expert MVP
        • Jan 2007
        • 12517

        #4
        Not a problem. Since it's sorted, you don't need nested loops but you will need the one loop to iterate through the whole array.

        Comment

        • chaarmann
          Recognized Expert Contributor
          • Nov 2007
          • 785

          #5
          If you have a rough estimate on how often each number is in the array in average, you don't need to iterate through the whole array by single steps, because it's sorted. This speeds up your program very much.

          Here is the idea:
          Let's say you know that each number is in the array 10 times on average.
          First look at the first element.
          Then look at the tenth element.
          Case 1: If it's the same, go on forward by single steps until you reach a different element. (For example if the thirteenth element is different, you have used only 5 lookups for 13 elements, which is more than 2 and a half times faster than going by single step).
          Case 2: If it's not the same, then go backward by single steps until you find one that's the same. (For example if the seventh element is the same, you have used only 5 lookups for 10 elements, which is more than 2 times faster than going by single step).
          Then use the greatest position (11 or 14 as new starting point and repeat the whole procedure.

          If you can't estimate the average, you can use a statistic from the first elements to guess the average of further elements ( the jump size) while going through the array. This works well if you know that there is an equal distribution among the different numbers in the array.

          If there is no such distribution, that means among 200000 elements there can be 10000 times one element and only 2 times another, you can use binary jump steps: Jump by 2, 4, 8, 16, 32, 64 etc. That means look at the first, third, seventh, fifteenthe etc. element. With this strategy you only need 10 lookups to count more than 1000 elements of a row.

          If you know in advance how many different elements are there in your array, or if you know all names of all elements that occur in your array, you can even optimize more.

          Comment

          • chaarmann
            Recognized Expert Contributor
            • Nov 2007
            • 785

            #6
            Just one more idea:
            Use an algorithm like the binary search: always cut the jump size in half when going forward or backward, don't go by single steps.
            For example compare element 1 with 1024. If they are not the same, compare one with 512. If they are not the same, compare one with 256, if they are the same, compare one with 256+128=384etc.

            Comment

            • toolic
              Recognized Expert New Member
              • Sep 2009
              • 70

              #7
              200k will not take up much memory, so just loop through the array storing the values in a hash. This works whether the values are already sorted or not.

              Code:
              use warnings;
              use strict;
              use Data::Dumper;
              
              my @array = (0,0,0,1,1,1,1,2,3,3);
              my %hash;
              for (@array) {
                  $hash{$_}++;
              }
              print Dumper(\%hash);
              
              __END__
              
              $VAR1 = {
                        '1' => 4,
                        '3' => 2,
                        '0' => 3,
                        '2' => 1
                      };

              Comment

              Working...