https://timodenk.com/blog/program-arduino-in-assembly-or-c-cpp/

====== One second delay loop ======
 

<file asmdelay.asm>

main:
  sbi 0x04, 5       ; PORTB5 output
loop:               ; main loop begin
  sbi 0x05, 5       ; PORTB5 high
  call delay_1000ms ; delay 1s
  cbi 0x05, 5       ; 5 PORTB5 low
  call delay_1000ms ; delay 1s
  rjmp  loop        ; main loop

delay_1000ms:       ; subroutine for 1s delay
                    ; initialize counters
  ldi r18, 0xFF     ; 255
  ldi r24, 0xD3     ; 211
  ldi r25, 0x30     ; 48
inner_loop:
  subi  r18, 0x01   ; 1
  sbci  r24, 0x00   ; 0
  sbci  r25, 0x00   ; 0
  brne  inner_loop
  ret
  
  </file>
  
While the main program is relatively easy to understand, it’s probably not obvious, how the delay_1000ms function works (thanks to Ido Gendel for investigating on that). On a 16MHz clock the task of a one-second-delay-function is essentially to keep the CPU busy for 16 million clock cycles.

The first three ldi (“Loads an 8-bit constant directly to register 16 to 31.“) instructions take three cycles in total.
subi takes one cycle as well and “subtracts a register and a constant, and places the result in the destination register Rd.” The constant is 0x01 in the snippet and the execution time is once again one clock cycle.
Each of the following two  sbci instructions “subtracts a constant from a register and subtracts with the C Flag, and places the result in the destination register Rd“. The constant is 0x00 in both cases so the only thing that is subtracted is the C Flag. The C Flag however is only set, if the previous subtraction has resulted in an underflow (0 -1 -> 255).
Finally, the brne instruction “[…] tests the Zero Flag (Z) and branches relatively to PC if Z is cleared.” This takes two cycles if it branches, otherwise one.

When going through the statements step by step that leads to a total number of n
clock cycles for the subroutine body:

n=3+256 x 212 x 5 +256 x 256 x 48 x 5 -1  

= 16000,002


===== One second delay using a timer =====


<file timer_delay.asm>

.set PINB =     0x03
.set DDRB =     0x04
.set TCCR0B =   0x25
.set TCNT0 =    0x26
.set LED_MASK = 0b00100000
.set PS_1024 =  0b00000101

setup:
    ldi r16, PS_1024    ; Set r16 with prescaler 1024 value
    out TCCR0B, r16     ; Set the TCCROB to 1024
    ldi r16, LED_MASK   ; Set r16 to the LED bit
    out DDRB, r16       ; Set LED pin to output
    clr r18             ; Clear the saved timer
loop:
    ldi r20, 61         ; Initialize our software counter
check_timer:
    in r17, TCNT0       ; Read the timer
    cp r17, r18         ; Compare with previous value
    mov r18, r17        ; Save current value
    brsh check_timer    ; unless the timer has decreased, repeat
decrement:
    dec r20             ; decrement the software counter
    brne check_timer    ; if not zero, go back to checking the timer
toggle:
    out PINB, r16       ; toggle the LED
    rjmp loop

</file>


===== Counting cycles for delay =====


<file asm loop370cycles>

    ldi r16,123 ;1
    DELAY:
    dec r16 ;1
    brne DELAY ; 1/2
</file>

That's all there is to it, and during time sensitive code, where you are trying to determine cycle count, it helps to add the cycle count next to the instruction.

How this delay works...

- ldi r16,123 : loads a register (r16) with a value, this takes one cycle.

- DELAY: sets the address to jump back to during the delay count down loop.

- dec r16 : this subtracts one from the register r16, and takes one cycle

- brne DELAY : this is a conditional branch that either branches back to DELAY if r16 is not zero or continues to your next line of code.

brne is counted as 1/2 because it takes 2 cycles to branch back to DELAY, but only one cycle to continue. Because of this, you have to offset your time by one cycle to obtain a perfect known cycle count through the delay routine. But this is taken care of by the first instruction (ldi r16,123).

So...

The value of r16 is decremented down to zero, but each time, the DELAY loop eats up 3 cycles, so it takes a total of 369 cycles plus the initial loading of r16 that took one more cycle.

Total time is now 370 cycles exactly.

To add more time to the loop, either increase the value of r16 or just add NOPs (single cycle) in the loop and multiply to get the new value.

Here, I delay for 1406 cycles...


<file asm loop1406cycles>

    ldi r16,234 ;1
    DELAY:
    nop ;1
    nop ;1
    nop ;1
    dec r16 ;1
    brne DELAY ; 1/2
    nop ;1
</file>

Count the cycles, and you will see that the delay takes exactly 1406 cycles. No more, no less.

Don't forget that if you are calling this routine, then add the cycle time for getting there and returning as well.

Most embedded control processors including AVR have interrupts occurring. Such as a 1000 or 100Hz timer interrupt.

A) you can use the counter in the interrupt handler for delays

B) any assembly language timer will likely be interrupted briefly by the timer in (A).

 
So delay loops based on CPU cycles are usually not a good idea.