Miner optimizations talk

Post Reply
Pttn
Posts: 143
Joined: 24 Aug 2018, 13:37

Miner optimizations talk

Post by Pttn » 11 Sep 2018, 09:32

Propose your suggestions or even code to improve the miner in this topic.
rieMiner - Riecoin solo + pooled miner
Personal Riecoin page (links, download,...)
freebitco.in - earn up to $200 in BTC each hour!

Pttn
Posts: 143
Joined: 24 Aug 2018, 13:37

Re: Miner optimizations talk

Post by Pttn » 11 Sep 2018, 09:43

IGJ wrote:
11 Sep 2018, 08:45
In theory it should be like you say, because the chance each from all 6 numbers to be prime is nearly same. But in practice you have little bit different result. Here is one example Ive made:

cand +4 +2 +4 +2 +4 ++ fail
7 11 13 17 19 23 OK block
17 [21] 23 27 29 (33) BAD 2 SAVE 0
37 41 43 47 [(49)] 53 BAD 5 SAVE 1
47 [51] 53 57 59 (63) BAD 2 SAVE 0
67 71 73 [(77)] 79 83 BAD 4 SAVE 0
97 101 103 107 109 113 OK block
107 [111] 113 117 119 (123) BAD 2 SAVE 0
127 131 [133] 137 139 (143) BAD 3 SAVE 1
137 [141] 143 147 149 (153) BAD 2 SAVE 0
157 [161] 163 167 (169) 173 BAD 2 SAVE -1
167 [171] 173 177 179 (183) BAD 2 SAVE 0
197 [201] 203 207 209 (213) BAD 2 SAVE 0
227 [231] 233 237 239 (243) BAD 2 SAVE 0
257 [261] 263 267 269 (273) BAD 2 SAVE 0
277 281 283 [287] (289) 293 BAD 4 SAVE 1
307 311 313 317 [319] (323) BAD 5 SAVE 3
317 321 [323] 327 329 (333) BAD 3 SAVE 1
337 [341] (343) 347 349 353 BAD 2 SAVE -3
347 [351] 353 357 359 (363) BAD 2 SAVE 0
367 [371] 373 (377) 379 383 BAD 2 SAVE -2
397 401 [403] 407 409 (413) BAD 2 SAVE 1
457 461 463 467 [469] (473) BAD 5 SAVE 3

---- Sum 77 checks , Saves 5 checks ----
[] <- where ++ cycle will break
() <- where -- cycle will break

PS: If you want start thread in developer corner where we can move to talk, I will move my posts, so can save this thread clear for official annonces.
How about testing much more sextuplets, like 1000000+? I am still not convinced. And not only more sextuplets, but very big sextuplets. You might get like a 0.1% save, and this will be the statistical error.

I myself wrote a program to compute all the average number of tests needed for all the 6! = 720 ways to test, and all of them need the same number of tests on average.

Code: Select all

Permutations : 720
Cases : 64
012345 - Average tests: 1.96875 <-- Test in the normal order
012354 - Average tests: 1.96875
012435 - Average tests: 1.96875
[...]
543120 - Average tests: 1.96875
543201 - Average tests: 1.96875
543210 - Average tests: 1.96875 <-- Test in the inverse order 
Gains will cancel out with losses; nothing will be improved. It must be pretty much the same in practise. But if you test millions of tuples and still get a few % save, I might reconsider my position and try in rieMiner.
rieMiner - Riecoin solo + pooled miner
Personal Riecoin page (links, download,...)
freebitco.in - earn up to $200 in BTC each hour!

IGJ
Posts: 46
Joined: 21 Aug 2018, 11:43

Re: Miner optimizations talk

Post by IGJ » 11 Sep 2018, 19:41

Here is fast, ugly, bad implementation of my idea. It is only for benchmark :)

--- of/rieMiner/miner.cpp 2018-09-11 22:25:56.230924554 +0300
+++ rieMiner/miner.cpp 2018-09-11 22:31:31.426912644 +0300
@@ -317,6 +317,7 @@
// fallthrough: job.type == TYPE_CHECK
if (job.type == TYPE_CHECK) {
for (uint64_t idx(0) ; idx < job.testWork.n_indexes ; idx++) {
+ stats.totalCandidatesCheck++;
uint8_t nPrimes(0);
mpz_set(z_temp, miner.primorial);
mpz_mul_ui(z_temp, z_temp, job.testWork.loop*riecoin_sieveSize);
@@ -335,6 +336,17 @@
continue;

nPrimes++;
+
+#ifdef IGJ
+ mpz_add_ui(z_temp, z_temp, 16);
+ mpz_sub_ui(z_ft_n, z_temp, 1);
+ mpz_powm(z_ft_r, z_ft_b, z_ft_n, z_temp);
+ if (mpz_cmp_ui(z_ft_r, 1) != 0)
+ continue;
+
+ mpz_sub_ui(z_temp, z_temp, 16);
+#endif
+
// Note start at 1 - we've already tested bias 0
for (int i(1) ; i < 6 ; i++) {
mpz_add_ui(z_temp, z_temp, primeTupleOffset);


--- of/rieMiner/global.h 2018-09-11 22:25:56.230924554 +0300
+++ rieMiner/global.h 2018-09-11 22:28:19.770919454 +0300
@@ -41,6 +41,8 @@
uint32_t difficulty, blockHeightAtDifficultyChange;
std::chrono::time_point<std::chrono::system_clock> start, startMining, lastDifficultyChange;
bool solo;
+
+ uint32_t totalCandidatesCheck=0;

Stats() {
for (uint8_t i(0) ; i < 7 ; i++) {
@@ -66,6 +68,14 @@
printTime();
std::cout << " (2/3t/s) = (" << FIXED(2) << foundTuplesSinceLastDifficulty[2]/elapsedSecs << " " << FIXED(3) << foundTuplesSinceLastDifficulty[3]/elapsedSecs << ") ; "
<< "(2-6t) = (" << foundTuples[2] << " " << foundTuples[3] << " " << foundTuples[4] << " " << foundTuples[5] << " " << foundTuples[6] << ")";
+
+
+ double elapsedSecs(timeSince(startMining));
+ uint32_t elapsedSecsInt(elapsedSecs);
+ float pps = (double)totalCandidatesCheck / (double)(elapsedSecsInt);
+
+ printf("\ncand/sec: %.2lf / total cand: %d / elapsed time %d seconds\n", pps, totalCandidatesCheck, elapsedSecsInt);
+
}
}

To activate the change compile with:
make "CXX=g++ -DIGJ"

On my I5-3570 with 3 threads I see around 5% more candidates checked per second.

Pttn
Posts: 143
Joined: 24 Aug 2018, 13:37

Re: Miner optimizations talk

Post by Pttn » 12 Sep 2018, 01:12

One reason that would give us an improvement is that the probability for each number to be prime is actually sligthly different, such that testing the ones with the lowest first would be useful. But the probability are the same, or have negligible differences. To be absolutely sure, I calculated the probability for each member to be prime for a very large number of candidates. After 1 hour, or 4.23 millions candidates tested, here are the probabilities to be prime (sieve 2^29, Difficulty 1600): 0.032284 0.032324 0.032158 0.032524 0.032228 0.032286

There is a 1.138% difference between the most frequent and the least one, so you might think "great, so by changing the order, we can gain at least 1 %!". But it will actually not be the case. With that data, we can choose to test the number that is the least likely to be prime first, and the one with the highest probability last, and compute how much tests we will need on average. We can also consider the worst case. The probability distribution for the "best" way to test is (number of tests/probability)

Code: Select all

1 0.967842
2 0.0311216
3 0.00100292
4 0.0000323785
5 0.00000104533
6 0.0000000349136
And the worst

Code: Select all

1 0.967476
2 0.0314726
3 0.00101736
4 0.0000328466
5 0.00000106048
6 0.0000000353153
They have respective expected values of 1.0332289 and 1.0336103 (which is the average number of tests to do). So, you would skip up to 0.0369% of all the checks by choosing the best order, and this is the statistical error. If I rerun the test, I would get another "best" order, and if I tend time to infinity, they will all have the same probability (or maybe with a difference of something like 2^(-1600) each)

Now, an interesting question would be to know whether if these probabilities are actually independent, and this would be the other and only one reason that we could improve the miner by changing the testing order: given that n is prime, will this make n + 16 less or more likely to be prime? Or any other combination. But I am pretty sure that they are really independent, and testing, at least for the case I just said (probability that n + 16 is prime given that n is prime or not), confirmed that in both cases, n + 16 will have the same probability (about 0.12945 for difficulty 400) to be prime, so they are independent. I am pretty sure that it is the same for other pairs. With this, we can conclude that there is no way that changing the testing order will improve anything.

You certainly did not enough testing, or some external factor influenced your results.

And even if we could gain a few %, just changing the sieve size would improve much more the performance. We would need to find something that gives us 20-30% improvement or more, not just 2-3%. Or find a way to do efficient GPU mining, but I would prefer that Riecoin stays CPU only (or CPU + GPU without any advantage for GPUs).
rieMiner - Riecoin solo + pooled miner
Personal Riecoin page (links, download,...)
freebitco.in - earn up to $200 in BTC each hour!

IGJ
Posts: 46
Joined: 21 Aug 2018, 11:43

Re: Miner optimizations talk

Post by IGJ » 12 Sep 2018, 11:15

You are absolutely right for prime probability in infinity. In infinity the probability six and second numbers to be prime will be same. But here we have factors that changing the picture... how sieve working (I don't know always when look code part for sieving cant understand what it do in details), we don't have infinity number lengths, maybe on numbers with length of 400 bits probability will be flat, on numbers with length 1400 to one end, with length 1600 to other end, so in final it will be exactly as you say - balanced.

Anyway when I'm using your miner in solo environment will implement different checking order :) I still can see there is little boost with that (not entirely reverse order but 1 - 6 - 2 - 5 -3 - 4 even better will be to write algorithm that dynamically change the pattern depending on statistic it gathering).

Did you study how sieve in the miner works ? If so can you explain me in details ? I know that every starting number of 6 tuple prime chain must over on 7 in decimal, and sieve gives only such numbers, but are these number every 7 over number (97,107,117,127,137,147...). Or it do and some very basic prime sieving, and gives only probably prime numbers that over on 7 ? when I played with DGA's miner spending time in sieving was not much, and I did not play with, also the code was very hard to read and understand what it do.

It is possible to achieve 20%-30% speed up of the miner, but it will break system portability (it will run only on specific architectures), and have to be binary. To achieve such drastic increase we need to rewrite some of the critical points on assembler and to use architecture specific instructions. Currently I'm using such miners based on DGA's source, and may release them to community, but will be hard to be used from non advanced users. Other way will be to find different algorithm on which we to find 6 tuples. Maybe if we can do better and fast sieving that give us more probably prime numbers to check with Fermat's theorem.

About GPU or GPU+CPU miner... Last year I made a lot of tests, wrote a lot of code, even on nvidia assembler, in tries to implement Fermat check for GPU. All of them was only lost time :). GPU + big numbers + division = very slow :) Most likely I'm wrong, because if this address RLGzidMRkhfqEWjUY8fVxmgLS7HCdogXYN is not some of the pools or using massive bot net, maybe mining with GPU. XPM have gpu mining implementation, but XPM algorithm working with much smaller numbers, In best days riecoin algorithm working with 1700+ bits numbers which drastically increase needed limbs and calculation operations. After all my unsuccessful attempts to create fast Fermat check for numbers 1300+ bits on GPU, nearly lost my motivation, but maybe is possible to create strong prime sieving on gpu and then to check with cpu, the potential problem here will be transfer time of candidates from gpu memory to system memory, this may kill all advantage we will have doing operations in GPU.

Pttn
Posts: 143
Joined: 24 Aug 2018, 13:37

Re: Miner optimizations talk

Post by Pttn » 12 Sep 2018, 11:57

IGJ wrote:
12 Sep 2018, 11:15
Did you study how sieve in the miner works ? If so can you explain me in details ? I know that every starting number of 6 tuple prime chain must over on 7 in decimal, and sieve gives only such numbers, but are these number every 7 over number (97,107,117,127,137,147...). Or it do and some very basic prime sieving, and gives only probably prime numbers that over on 7 ? when I played with DGA's miner spending time in sieving was not much, and I did not play with, also the code was very hard to read and understand what it do.
Read this, it was written by dga, the programmer of fastrie. In particular, this is much better than just testing numbers finishing with 7.

Now, in the code, it seems that they did some further optimizations, and I do not have any clue what they did: what the hell are all theses variables entriesPerSegment, denseLimit, segmentCounts, n_dense, n_sparse, segment_hits??? If dga could explain these to us, I might consider to try to optimize the miner a bit.

I would really like that this project stays CPU only, and hate if ASIC mining becomes one day possible. But it seems that the higher the difficulty is, the harder GPU/specialized mining will be to program. Would you or someone be able to program an efficient GPU mining algorithm for difficulties of 2500 and more when Riecoin will be much more known? And when quantum computing will be mainstream, I wonder if the proof of work system would be broken as quantum algorithms seem to break many prime number based things.

I also wonder who RLGzid..., RSoT..., and another solo miner with random addresses are... How egoistic it would be to improve so much the algorithm and not share it to everyone. But if they did, someone else will as well, and they will eventually lose their advantage. That said, it is currently fairly easy to have a nice part of the network. Using a few Bi/Quad 32 Cores AMD Epyc would be enough to give you a few % of the network "hashrate". The huge miners might simply be server owners using unused resources of their computers. Even with only a 8 Cores, I am getting like 0.1-0.2% of the mining power.
rieMiner - Riecoin solo + pooled miner
Personal Riecoin page (links, download,...)
freebitco.in - earn up to $200 in BTC each hour!

IGJ
Posts: 46
Joined: 21 Aug 2018, 11:43

Re: Miner optimizations talk

Post by IGJ » 13 Sep 2018, 10:51

Thank you for the link, and thanks to DGA for the good explanation.

I got the idea what sieve must do and miss 2/3 of my last night sleep to testing and experimenting :). I still cant understand the sieve code in DGA's miner, nor why we need so much memory for sieving, nor why there is so much code, when all of this can be done with few not very big arrays and one cycle :). Also somehow current implemented sieve algorithm gives me some strange result. Anyway I will try to write entirely new sieve and to test, but first must understand how to create the starting point. Do you know how this is done ? I mean how initial starting number, from where you must start to check, is creating and how the last number is defined. Is it come as arguments in miner.cpp or it is generated in miner.cpp from some subroutine ?

If there is good documentation and there is proper algorithm for the hardware it will be possible. But there are and limitation based on current hardware developed level. 64bit architecture means that your computing unit can compute 64bit long in one instruction (for example your modern cpu can compute 64bit number + 64bit number with only one instruction and return the 64bit number + carry if there is such). To can compute longer numbers you need to split this number in limbs so to sum two 128bits number you will have to have two arrays of 2 64bit numbers and do summing of the arrays and handle the carries. If Riecoin hardness is 2500 that means minimal length of the prime number we search will be 2500 bits and thats 40 64bit limbs which will lead to minimum 80 instructions to be executed only to summing 2 such numbers (something useless for us). Obviously for prime number finder most used operation is division imagine how many instructions you have to execute to divide two arrays with 40 elements. There is and other problem here, division operations cant be made in parallel on different cores, because next step in division depending on previous, so you cant get much advantage from multicores computing unit as GPU and ASIC. Now there comes and the third problem it is in electronic design of the computing unit. To realize division instruction in computing unit you need a huge number of transistors which leads to more power consumption, more time this instruction to be executed and bigger silicon crystal (more defects during manufacture). Even some architectures don't have instruction for division and it is realize by software with summing or multiplying (compiler do this for you). GPU in practice is ASIC , it get computational boost from many cores that the chip have. Also modern GPUs have cores for 32bit and cores for 64bit instructions and 64bit cores are much less than 32bit cores as number (as I know you cant use 32bit cores and 64bit cores in same time), so splitting big numbers to 32bit limbs can allow you use more cores, but may kill your boost because they will need to do more operations. If you do operations like "and" "xor" "or" e.t.c. which can be done in parallel can get big speed up in your algorithm but for division I don't see how this can happen. When someday we have 1024bit computes, or somebody develope algorithm for division that can be calculate in parallel from many cores, or somebody manufacture ASIC that can do division very fast, yes Riecoin difficulties of 2500 will be like difficulties 300 now :).

To be honest right now we don't need very fast public miner, because production of many new coins, may totally kill the project. We need more advertising, and to make riecoin more popular, to involve more people in it. I think we should create services to make all riecoins we have by now usable somehow. When the time come maybe somebody will release much better miner or algorithm if not we will create such and will release :).

IGJ
Posts: 46
Joined: 21 Aug 2018, 11:43

Re: Miner optimizations talk

Post by IGJ » 05 Oct 2018, 12:14

Ahoy,

About the problem with 6+ threads and rieMiner. I do some tests (by the way you can simulate it on machines with less than 6 cores when skip 1/2 fermat checks) and think the problem comes from this function:
void Miner::process(WorkData block)

It cant supply fast enough work to workers threads. Here how I believe the things work... in main the miner spawn user_defined + 1 threads, all user defined threads become workers, the one extra thread become master it do some calculation, getting workdata and put jobs to worker threads. A possible solution to fix the problem, is eventually part of these calculations to be moved in worker threads but then the part with sieving must be rewrite. I think all this threading logic is wrong and need total rewrite, because it also produces invalid shares. To make things right every thread must have own allocating memory and to do own sieving and checking, also thread/memory locking/unlocking on busy threads must be down to minimum. This will give and some performance boost too.

here is something useful:
https://software.intel.com/en-us/articl ... ng-threads

Unfortunately i am heavy busy with my full time job and dont have much time to write but will put this in my task list if you not fix it meantime :)

Pttn
Posts: 143
Joined: 24 Aug 2018, 13:37

Re: Miner optimizations talk

Post by Pttn » 05 Oct 2018, 21:55

Thank you for this hypothesis and its explanations. Unfortunately, I am myself very busy, so I cannot get enough focus to try to fix this myself. This is the point of my bounty, to encourage someone else to write the fix, with a bigger one during a limited time lapse to also encourage to fix this more quickly.

I am also not the writer of that part of the miner, so if it is bad, it is not my fault :p. I essentially just ported it to more proper C++.

Let's extend the bounty deadline to November 1 2018, 00:00 Zurich timezone. I also increase it to 2000 RIC.

I will offer 5000 RIC if someone not only fix the CPU underuse bug, but also rewrites/refactors completely the mining code (mainly miner.h and .cpp) such that the problems pointed out by IGJ are fixed. The Memsets and any other C part (raw arrays to Std::Arrays or Vectors, ...) must be replaced by pure and modern C++ counterparts. It would also be nice if we could get rid of the tsqueue source file at the same time, if possible. And commented code is welcomed as well.

The proposed code must use proper/pure C++ (no C things, we are in 2018), not use any other library (Boost,..., but if you need C++14 or 17, it is Ok), and be very maintainable and elegant. The style should match the rest (using tabulations, CamelCase names,...). It must at least compile in Linux and Windows (MSYS2) x64 with G++ without particular effort, using at most a few "IfDefs". You must also accept to release your fix in the MIT license to receive the reward.

I will evaluate and test the code, and to receive the bounty, another active developer (IGJ or clo1) must agree that the fix is good enough (not needed if this is one of you who submit the code). If you are a candidate and unsure about the conditions to earn the bounty, you can always ask. I promise to pay if a fix are satisfactory. But only to the first submitted one if multiple acceptable fixes are submitted, although developers can collaborate and share the bounty.

Remember that 5000 RIC was worth hundreds of $ before the crash, and could be much more in the future.

After the deadline, the reward will be reduced to 500 RIC (or 1250 for the complete rewriting) until January 1 2019 00:00 Zurich Timezone, or as soon as the new wallet is officially approved. So please someone try to fix this before November.
rieMiner - Riecoin solo + pooled miner
Personal Riecoin page (links, download,...)
freebitco.in - earn up to $200 in BTC each hour!

IGJ
Posts: 46
Joined: 21 Aug 2018, 11:43

Re: Miner optimizations talk

Post by IGJ » 06 Oct 2018, 13:45

I looked the code in Miner::process(WorkData block) again and have some thoughts why the original author make it this. The idea behind maybe is to sync all worker threads to do same job (sieving / checking) in same time. In theory this could help CPU cache to store more instructions that are needed. I dont know how effective is this, but definitely as many thread you have as many of them will wait others to over their job to can sync and start the new job at same time. Also OS thread scheduler can interrupt thread/process work in any time and give time window other thread/process to run too. This will create extra desynchronization and will add more time worker threads to sleep and do nothing. The idea is not bad at all, and maybe could be implemented without need threads to be locked and sleep as they are now. Here is useful information how CPU cache is organized in intel's coffee lake chips:
https://en.wikichip.org/wiki/intel/micr ... _Hierarchy
Also to avoid memory locks when in/out data from worker thread is needed, atomic operation should be used (keep in mind on 32bit architecture, 64bit operation is not atomic)

Post Reply