Hi there.
First of, this is only for the fun of it, and I have no illusions that it will ever lead anywhere. I simply want to learn and have fun while doing it.
So, I've written a small program, which calculates all primes, using a regular home brewed sieve, within the range of 32 bits, the idea being that nothing more is needed to test values within the 64 bit range.
Every odd number from 1 - 2^32 - 1 is represented in a bitset. True if prime and otherwise false.
The data however takes up 256 mb of memory. This admittedly is hard to do anything about, but compressing and storing the data on disk is of course another matter entirely.
I've been thinking long and hard about the most efficient method of doing so, and have literally spend hours searching for common patterns and all that.
In the end though, only one approach seems to make sense. The thing is that the data contains a whole lots of long strings of zeros (that is bits not characters), all in all taking up around 231 mb of space, in some 230 million intervals.
This has to be hugely compressible, but I simply know way too little about the subject and don't really know where to start.
I'm currently considering some form of compound Hoffman coding, but I'm not entirely sure it's the right approach, and not sure at all how to implement it. I have however generated a list of stream of consecutive zeros and how often they occur, and I think it makes sense.
When all that is said, I'm mostly just interested in general advice. Have I missed any appropriate compression algorithms, do you have any ideas or advice, and so on and so forth.
Best regards.
First of, this is only for the fun of it, and I have no illusions that it will ever lead anywhere. I simply want to learn and have fun while doing it.
So, I've written a small program, which calculates all primes, using a regular home brewed sieve, within the range of 32 bits, the idea being that nothing more is needed to test values within the 64 bit range.
Every odd number from 1 - 2^32 - 1 is represented in a bitset. True if prime and otherwise false.
The data however takes up 256 mb of memory. This admittedly is hard to do anything about, but compressing and storing the data on disk is of course another matter entirely.
I've been thinking long and hard about the most efficient method of doing so, and have literally spend hours searching for common patterns and all that.
In the end though, only one approach seems to make sense. The thing is that the data contains a whole lots of long strings of zeros (that is bits not characters), all in all taking up around 231 mb of space, in some 230 million intervals.
This has to be hugely compressible, but I simply know way too little about the subject and don't really know where to start.
I'm currently considering some form of compound Hoffman coding, but I'm not entirely sure it's the right approach, and not sure at all how to implement it. I have however generated a list of stream of consecutive zeros and how often they occur, and I think it makes sense.
Code:
Number of consecutive zeros and number of occurrences: 2: 22851920| 3: 10204618| 4: 13227107| 5: 17178814 6: 9537739| 7: 7203001| 8: 13186171| 9: 7297439 10: 6257057| 11: 9566572| 12: 4581564| 13: 4994988 14: 9078006| 15: 2892527| 16: 3044808| 17: 5013791 18: 2373033| 19: 2822346| 20: 4222840| 21: 1740263 22: 1509880| 23: 2624423| 24: 1504728| 25: 1140228 26: 1916025| 27: 999131| 28: 862089| 29: 1795026 30: 569130| 31: 580761| 32: 1076487| 33: 443027 34: 635301| 35: 651041| 36: 322809| 37: 289021 38: 554641| 39: 287223| 40: 203909| 41: 425884 42: 154057| 43: 160181| 44: 325950| 45: 110541 46: 101330| 47: 179534| 48: 94932| 49: 99012 50: 132684| 51: 61061| 52: 53934| 53: 93571 54: 56929| 55: 44838| 56: 69166| 57: 29504 58: 29007| 59: 62466| 60: 19971| 61: 20185 62: 38812| 63: 14024| 64: 19315| 65: 24588 66: 10598| 67: 9393| 68: 18000| 69: 11242 70: 6786| 71: 11614| 72: 4993| 73: 5492 74: 11316| 75: 3611| 76: 4497| 77: 6341 78: 2622| 79: 3059| 80: 4086| 81: 1961 82: 1732| 83: 3871| 84: 1856| 85: 1232 86: 2207| 87: 1027| 88: 905| 89: 2006 90: 784| 91: 686| 92: 1070| 93: 425 94: 667| 95: 739| 96: 354| 97: 382 98: 648| 99: 315|100: 232|101: 413 102: 173|103: 170|104: 419|105: 124 106: 99|107: 188|108: 81|109: 114 110: 110|111: 66|112: 56|113: 87 114: 56|115: 32|116: 89|117: 32 118: 34|119: 56|120: 22|121: 18 122: 28|123: 17|124: 15|125: 25 126: 6|127: 10|128: 16|129: 8 130: 8|131: 10|132: 10|133: 5 134: 10|135: 2|136: 2|137: 7 138: 1|139: 4|140: 6|141: 3 142: 2|143: 5|144: 5|145: 3 151: 1|152: 1|154: 1|159: 1 167: 2| | |
Best regards.
Comment