https://timodenk.com/blog/program-arduino-in-assembly-or-c-cpp/ ====== One second delay loop ====== main: sbi 0x04, 5 ; PORTB5 output loop: ; main loop begin sbi 0x05, 5 ; PORTB5 high call delay_1000ms ; delay 1s cbi 0x05, 5 ; 5 PORTB5 low call delay_1000ms ; delay 1s rjmp loop ; main loop delay_1000ms: ; subroutine for 1s delay ; initialize counters ldi r18, 0xFF ; 255 ldi r24, 0xD3 ; 211 ldi r25, 0x30 ; 48 inner_loop: subi r18, 0x01 ; 1 sbci r24, 0x00 ; 0 sbci r25, 0x00 ; 0 brne inner_loop ret While the main program is relatively easy to understand, it’s probably not obvious, how the delay_1000ms function works (thanks to Ido Gendel for investigating on that). On a 16MHz clock the task of a one-second-delay-function is essentially to keep the CPU busy for 16 million clock cycles. The first three ldi (“Loads an 8-bit constant directly to register 16 to 31.“) instructions take three cycles in total. subi takes one cycle as well and “subtracts a register and a constant, and places the result in the destination register Rd.” The constant is 0x01 in the snippet and the execution time is once again one clock cycle. Each of the following two sbci instructions “subtracts a constant from a register and subtracts with the C Flag, and places the result in the destination register Rd“. The constant is 0x00 in both cases so the only thing that is subtracted is the C Flag. The C Flag however is only set, if the previous subtraction has resulted in an underflow (0 -1 -> 255). Finally, the brne instruction “[…] tests the Zero Flag (Z) and branches relatively to PC if Z is cleared.” This takes two cycles if it branches, otherwise one. When going through the statements step by step that leads to a total number of n clock cycles for the subroutine body: n=3+256 x 212 x 5 +256 x 256 x 48 x 5 -1 = 16000,002 ===== One second delay using a timer ===== .set PINB = 0x03 .set DDRB = 0x04 .set TCCR0B = 0x25 .set TCNT0 = 0x26 .set LED_MASK = 0b00100000 .set PS_1024 = 0b00000101 setup: ldi r16, PS_1024 ; Set r16 with prescaler 1024 value out TCCR0B, r16 ; Set the TCCROB to 1024 ldi r16, LED_MASK ; Set r16 to the LED bit out DDRB, r16 ; Set LED pin to output clr r18 ; Clear the saved timer loop: ldi r20, 61 ; Initialize our software counter check_timer: in r17, TCNT0 ; Read the timer cp r17, r18 ; Compare with previous value mov r18, r17 ; Save current value brsh check_timer ; unless the timer has decreased, repeat decrement: dec r20 ; decrement the software counter brne check_timer ; if not zero, go back to checking the timer toggle: out PINB, r16 ; toggle the LED rjmp loop ===== Counting cycles for delay ===== ldi r16,123 ;1 DELAY: dec r16 ;1 brne DELAY ; 1/2 That's all there is to it, and during time sensitive code, where you are trying to determine cycle count, it helps to add the cycle count next to the instruction. How this delay works... - ldi r16,123 : loads a register (r16) with a value, this takes one cycle. - DELAY: sets the address to jump back to during the delay count down loop. - dec r16 : this subtracts one from the register r16, and takes one cycle - brne DELAY : this is a conditional branch that either branches back to DELAY if r16 is not zero or continues to your next line of code. brne is counted as 1/2 because it takes 2 cycles to branch back to DELAY, but only one cycle to continue. Because of this, you have to offset your time by one cycle to obtain a perfect known cycle count through the delay routine. But this is taken care of by the first instruction (ldi r16,123). So... The value of r16 is decremented down to zero, but each time, the DELAY loop eats up 3 cycles, so it takes a total of 369 cycles plus the initial loading of r16 that took one more cycle. Total time is now 370 cycles exactly. To add more time to the loop, either increase the value of r16 or just add NOPs (single cycle) in the loop and multiply to get the new value. Here, I delay for 1406 cycles... ldi r16,234 ;1 DELAY: nop ;1 nop ;1 nop ;1 dec r16 ;1 brne DELAY ; 1/2 nop ;1 Count the cycles, and you will see that the delay takes exactly 1406 cycles. No more, no less. Don't forget that if you are calling this routine, then add the cycle time for getting there and returning as well. Most embedded control processors including AVR have interrupts occurring. Such as a 1000 or 100Hz timer interrupt. A) you can use the counter in the interrupt handler for delays B) any assembly language timer will likely be interrupted briefly by the timer in (A). So delay loops based on CPU cycles are usually not a good idea.