P2 Docs

Memory	Instruction Encoding	HUB MEMORY INSTRUCTIONS
PTRA/PTRB INSTRUCTIONS	QUAD-RELATED INSTRUCTIONS	HUB CONTROL INSTRUCTIONS
INDIRECT REGISTERS	STACK RAM	BYTE/WORD FIELD MOVER
MULTI-TASKING	PIPELINE	INSTRUCTION-BLOCK REPEATING
HUB COUNTER	BRANCHES	COUNTERS
DACS	VIDEO	TEXTURE MAPPER
PIN TRANSFER

PROPELLER 2 MEMORY
------------------

In the Propeller 2, there are two primary types of memory:

HUB MEMORY

128K bytes of main memory shared by all cogs

- cogs launch from this memory
- cogs can access this memory as bytes, words, longs, and quads (4 longs)
- $00000..$00E7F is ROM - contains Booter, SHA-256/HMAC, and Monitor
- $00E80..$1FFFF is RAM - for application usage

COG MEMORY (8 instances)

512 longs of register RAM for code and data usage

- simultaneous instruction, source, and destination reading, plus writing
- last eight registers are for I/O pin control

256 longs of stack RAM for data and video usage

- accessible via push and pop operations
- video circuit can read data simultaneously and asynchronously

INSTRUCTION ENCODING
--------------------

Cog instructions are 32 bits long and comprised of several bit fields. There are two main types of
instructions: dual-operand and single-operand. Dual-operand instructions specify both a D register, which
usually is read and written back, and an S register which is read or used as an immediate value. Single-
operand instructions specify only a D register.

Dual-operand encoding:

TTTTTT ZCR I CCCC DDDDDDDDD SSSSSSSSS IF_x MNEM D,S/#n WZ,WC,NR

TTTTTT = Instruction according to instruction (MNEM)
I = SSSSSSSSS register or immediate, 0=register address (S), 1=immediate (#n)

Single-operand encoding:

000011 ZCR 1 CCCC DDDDDDDDD TTTTTTTTT IF_x MNEM D WZ,WC,NR

TTTTTTTTT = Instruction according to instruction (MNEM)

For both cases:

Z = Z flag write control: 0=don't write Z, 1=write Z
Defaults to 0, but may be set to 1 by adding WZ (Write Z) after operand(s)

Unless specified otherwise, the value written to Z is the NOR of the 32-bit D result.

C = C flag write control: 0=don't write C, 1=write C
Defaults to 0, but may be set to 1 by adding WC (Write C) after operand(s)

R = D register write control: 0=don't write D, 1=write D
Default varies by instruction, but may be cleared to 0 by adding NR (No Result)

CCCC = Execution condition (expressed by IF_x mnemonic prefix)
Determines Z/C flag conditions upon which the instruction will execute

CCCC condition CCCC mnemonic prefixes (in easy-to-read order)
---------------------------------------------------------------------
0000 never 1111 IF_ALWAYS (default)
0001 nc & nz 1100 IF_C IF_B
0010 nc & z 0011 IF_NC IF_AE
0011 nc 1010 IF_Z IF_E
0100 c & nz 0101 IF_NZ IF_NE
0101 nz 1000 IF_C_AND_Z IF_Z_AND_C
0110 c <> z 0100 IF_C_AND_NZ IF_NZ_AND_C
0111 nc | nz 0010 IF_NC_AND_Z IF_Z_AND_NC
1000 c & z 0001 IF_NC_AND_NZ IF_NZ_AND_NC IF_A
1001 c = z 1110 IF_C_OR_Z IF_Z_OR_C IF_BE
1010 z 1101 IF_C_OR_NZ IF_NZ_OR_C
1011 nc | z 1011 IF_NC_OR_Z IF_Z_OR_NC
1100 c 0111 IF_NC_OR_NZ IF_NZ_OR_NC
1101 c | nz 1001 IF_C_EQ_Z IF_Z_EQ_C
1110 c | z 0110 IF_C_NE_Z IF_Z_NE_C
1111 always 0000 IF_NEVER

DDDDDDDDD = Destination register address (D)

SSSSSSSSS = Source register address (S) or zero-extended immediate value (#n)

HUB MEMORY INSTRUCTIONS
-----------------------

These instructions read and write hub memory.

All instructions use D as the data conduit, except WRQUAD/RDQUAD/RDQUADC, which uses the four QUAD
registers. The QUADs can be mapped into cog register space using the SETQUAD instruction or kept
hidden, in which case they are still useful as data conduit and as a read cache. If mapped, the QUADs
overlay four contiguous cog registers. These overlaid registers can be read and written as any other
registers, as well as executed. Any write via D to the QUAD registers, when mapped, will affect the
underlying cog registers, as well. A RDQUAD/RDQUADC will affect the QUAD registers, but not the
underlying cog registers.

The cached reads RDBYTEC/RDWORDC/RDLONGC/RDQUADC will do a RDQUAD if the current read address is
outside of the 4-long window of the prior RDQUAD. Otherwise, they will immediately return cached
data. The CACHEX instruction invalidates the cache, forcing a fresh RDQUAD next time a cached read
executes.

Hub memory instructions must wait for their cog's hub cycle, which comes once every 8 clocks. The
timing relationship between a cog's instruction stream and its hub cycle is generally indeterminant,
causing these instructions to take varying numbers of clocks. Timing can be made determinant, though,
by intentionally spacing these instructions apart so that after the first in a series executes, the
subsequent hub memory instructions fall on hub cycles, making them take the minimal numbers of
clocks. The trick is to write useful code to go in between them.

WRBYTE/WRWORD/WRLONG/WRQUAD/RDQUAD complete on the hub cycle, making them take 1..8 clocks.

RDBYTE/RDWORD/RDLONG complete on the 2nd clock after the hub cycle, making them take 3..10 clocks.

RDBYTEC/RDWORDC/RDLONGC take only 1 clock if data is cached, otherwise 3..10 clocks.

RDQUADC takes only 1 clock if data is cached, otherwise 1..8 clocks.

After a RDQUAD, mapped QUAD registers are accessible via D and S after three clocks:

RDQUAD hubaddress 'read a quad into the QUAD registers mapped at quad0..quad3

NOP 'do something for at least 3 clocks to allow QUADs to update
NOP
NOP

CMP quad0,quad1 'mapped QUADs are now accessible via D and S

After a RDQUAD, mapped QUAD registers are executable after three clocks and one instruction:

SETQUAD #quad0 'map QUADs to quad0..quad3

RDQUAD hubaddress 'read a quad into the QUAD registers mapped at quad0..quad3

NOP 'do something for at least 3 clocks to allow QUADs to update
NOP
NOP

NOP 'do at least 1 instruction to get QUADs into pipeline

quad0 NOP 'QUAD0..QUAD3 are now executable
quad1 NOP
quad2 NOP
quad3 NOP

After a SETQUAD, mapped QUAD registers are writable immediately, but original contents are
readable via D and S after 2 instructions:

SETQUAD #quad0 'map QUADs to quad0..quad3 (new address)

NOP 'do at least two instructions to queue up QUADs
NOP

CMP quad0,quad1 'mapped QUADS are now accessible via D and S

On cog startup, the QUAD registers are cleared to 0's.

instructions clocks
---------------------------------------------------------------------------------------------------------
000000 000 0 CCCC DDDDDDDDD SSSSSSSSS WRBYTE D,S 'write lower byte in D at S 1..8
000000 000 1 CCCC DDDDDDDDD SUPNNNNNN WRBYTE D,PTR 'write lower byte in D at PTR 1..8
000000 Z01 0 CCCC DDDDDDDDD SSSSSSSSS RDBYTE D,S 'read byte at S into D 3..10
000000 Z01 1 CCCC DDDDDDDDD SUPNNNNNN RDBYTE D,PTR 'read byte at PTR into D 3..10
000000 Z11 0 CCCC DDDDDDDDD SSSSSSSSS RDBYTEC D,S 'read cached byte at S into D 1, 3..10
000000 Z11 1 CCCC DDDDDDDDD SUPNNNNNN RDBYTEC D,PTR 'read cached byte at PTR into D 1, 3..10

000001 000 0 CCCC DDDDDDDDD SSSSSSSSS WRWORD D,S 'write lower word in D at S 1..8
000001 000 1 CCCC DDDDDDDDD SUPNNNNNN WRWORD D,PTR 'write lower word in D at PTR 1..8
000001 Z01 0 CCCC DDDDDDDDD SSSSSSSSS RDWORD D,S 'read word at S into D 3..10
000001 Z01 1 CCCC DDDDDDDDD SUPNNNNNN RDWORD D,PTR 'read word at PTR into D 3..10
000001 Z11 0 CCCC DDDDDDDDD SSSSSSSSS RDWORDC D,S 'read cached word at S into D 1, 3..10
000001 Z11 1 CCCC DDDDDDDDD SUPNNNNNN RDWORDC D,PTR 'read cached word at PTR into D 1, 3..10

000010 000 0 CCCC DDDDDDDDD SSSSSSSSS WRLONG D,S 'write D at S 1..8
000010 000 1 CCCC DDDDDDDDD SUPNNNNNN WRLONG D,PTR 'write D at PTR 1..8
000010 Z01 0 CCCC DDDDDDDDD SSSSSSSSS RDLONG D,S 'read long at S into D 3..10
000010 Z01 1 CCCC DDDDDDDDD SUPNNNNNN RDLONG D,PTR 'read long at PTR into D 3..10
000010 Z11 0 CCCC DDDDDDDDD SSSSSSSSS RDLONGC D,S 'read cached long at S into D 1, 3..10
000010 Z11 1 CCCC DDDDDDDDD SUPNNNNNN RDLONGC D,PTR 'read cached long at PTR into D 1, 3..10

000011 000 1 CCCC DDDDDDDDD 010110000 WRQUAD D 'write QUADs at D 1..8
000011 001 1 CCCC SUPNNNNNN 010110000 WRQUAD PTR 'write QUADs at PTR 1..8
000011 000 1 CCCC DDDDDDDDD 010110001 RDQUAD D 'read quad at D into QUADs 1..8
000011 001 1 CCCC SUPNNNNNN 010110001 RDQUAD PTR 'read quad at PTR into QUADs 1..8
000011 010 1 CCCC DDDDDDDDD 010110001 RDQUADC D 'read cached quad at D into QUADs 1, 1..8
000011 011 1 CCCC SUPNNNNNN 010110001 RDQUADC PTR 'read cached quad at PTR into QUADs 1, 1..8
---------------------------------------------------------------------------------------------------------

PTR expressions:

INDEX = -32..+31 for simple offsets, 0..31 for ++'s, or 0..32 for --'s
SCALE = 1 for byte, 2 for word, 4 for long, or 16 for quad

S = 0 for PTRA, 1 for PTRB
U = 0 to keep PTRx same, 1 to update PTRx
P = 0 to use PTRx + INDEX*SCALE, 1 to use PTRx (post-modify)
NNNNNN = INDEX
nnnnnn = -INDEX

SUPNNNNNN PTR expression
-----------------------------------------------------------------------------
000000000 PTRA 'use PTRA
100000000 PTRB 'use PTRB
011000001 PTRA++ 'use PTRA, PTRA += SCALE
111000001 PTRB++ 'use PTRB, PTRB += SCALE
011111111 PTRA-- 'use PTRA, PTRA -= SCALE
111111111 PTRB-- 'use PTRB, PTRB -= SCALE
010000001 ++PTRA 'use PTRA + SCALE, PTRA += SCALE
110000001 ++PTRB 'use PTRB + SCALE, PTRB += SCALE
010111111 --PTRA 'use PTRA - SCALE, PTRA -= SCALE
110111111 --PTRB 'use PTRB - SCALE, PTRB -= SCALE

000NNNNNN PTRA[INDEX] 'use PTRA + INDEX*SCALE
100NNNNNN PTRB[INDEX] 'use PTRB + INDEX*SCALE
011NNNNNN PTRA++[INDEX] 'use PTRA, PTRA += INDEX*SCALE
111NNNNNN PTRB++[INDEX] 'use PTRB, PTRB += INDEX*SCALE
011nnnnnn PTRA--[INDEX] 'use PTRA, PTRA -= INDEX*SCALE
111nnnnnn PTRB--[INDEX] 'use PTRB, PTRB -= INDEX*SCALE
010NNNNNN ++PTRA[INDEX] 'use PTRA + INDEX*SCALE, PTRA += INDEX*SCALE
110NNNNNN ++PTRB[INDEX] 'use PTRB + INDEX*SCALE, PTRB += INDEX*SCALE
010nnnnnn --PTRA[INDEX] 'use PTRA - INDEX*SCALE, PTRA -= INDEX*SCALE
110nnnnnn --PTRB[INDEX] 'use PTRB - INDEX*SCALE, PTRB -= INDEX*SCALE

Examples:

000000 Z01 1 CCCC DDDDDDDDD 000000000 RDBYTE D,PTRA 'read byte at PTRA into D
000001 000 1 CCCC DDDDDDDDD 111000001 WRWORD D,PTRB++ 'write lower word in D at PTRB, PTRB += 2
000010 Z01 1 CCCC DDDDDDDDD 011111111 RDLONG D,PTRA-- 'read long at PTRA into D, PTRA -= 4
000011 001 1 CCCC 110000001 010110001 RDQUAD ++PTRB 'read quad at PTRB+16 into QUADs, PTRB += 16
000000 000 1 CCCC DDDDDDDDD 010111111 WRBYTE D,--PTRA 'write lower byte in D at PTRA-1, PTRA -= 1

000001 000 1 CCCC DDDDDDDDD 100000111 WRWORD D,PTRB[7] 'write lower word in D to PTRB+7*2
000010 Z11 1 CCCC DDDDDDDDD 011001111 RDLONGC D,PTRA++[15] 'read cached long at PTRA into D, PTRA += 15*4
000011 001 1 CCCC 111111101 010110000 WRQUAD PTRB--[3] 'write QUADs at PTRB, PTRB -= 3*16
000000 000 1 CCCC DDDDDDDDD 010000110 WRBYTE D,++PTRA[6] 'write lower byte in D to PTRA+6*1, PTRA += 6*1
000001 Z01 1 CCCC DDDDDDDDD 110110110 RDWORD D,--PTRB[10] 'read word at PTRB-10*2 into D, PTRB -= 10*2

Bytes, words, longs, and quads are addressed as follows:

for WRBYTE/RDBYTE/RDBYTEC, address = %XXXXXXXXXXXXXXXXX (bits 16..0 are used)
for WRWORD/RDWORD/RDWORDC, address = %XXXXXXXXXXXXXXXX- (bits 16..1 are used)
for WRLONG/RDLONG/RDLONGC, address = %XXXXXXXXXXXXXXX-- (bits 16..2 are used)
for WRQUAD/RDQUAD/RDQUADC, address = %XXXXXXXXXXXXX---- (bits 16..4 are used)

address byte word long quad
-------------------------------------------------------------------
00000- 50 *7250 *706F7250 *0C7CCC030C7C200020302E32706F7250
00001- 72 7250 706F7250 0C7CCC030C7C200020302E32706F7250
00002- 6F *706F 706F7250 0C7CCC030C7C200020302E32706F7250
00003- 70 706F 706F7250 0C7CCC030C7C200020302E32706F7250
00004- 32 *2E32 *20302E32 0C7CCC030C7C200020302E32706F7250
00005- 2E 2E32 20302E32 0C7CCC030C7C200020302E32706F7250
00006- 30 *2030 20302E32 0C7CCC030C7C200020302E32706F7250
00007- 20 2030 20302E32 0C7CCC030C7C200020302E32706F7250
00008- 00 *2000 *0C7C2000 0C7CCC030C7C200020302E32706F7250
00009- 20 2000 0C7C2000 0C7CCC030C7C200020302E32706F7250
0000A- 7C *0C7C 0C7C2000 0C7CCC030C7C200020302E32706F7250
0000B- 0C 0C7C 0C7C2000 0C7CCC030C7C200020302E32706F7250
0000C- 03 *CC03 *0C7CCC03 0C7CCC030C7C200020302E32706F7250
0000D- CC CC03 0C7CCC03 0C7CCC030C7C200020302E32706F7250
0000E- 7C *0C7C 0C7CCC03 0C7CCC030C7C200020302E32706F7250
0000F- 0C 0C7C 0C7CCC03 0C7CCC030C7C200020302E32706F7250
00010- 45 *FE45 *0DC1FE45 *0D7CC6010C7CC6010CFCB6E30DC1FE45
00011- FE FE45 0DC1FE45 0D7CC6010C7CC6010CFCB6E30DC1FE45
00012- C1 *0DC1 0DC1FE45 0D7CC6010C7CC6010CFCB6E30DC1FE45
00013- 0D 0DC1 0DC1FE45 0D7CC6010C7CC6010CFCB6E30DC1FE45
00014- E3 *B6E3 *0CFCB6E3 0D7CC6010C7CC6010CFCB6E30DC1FE45
00015- B6 B6E3 0CFCB6E3 0D7CC6010C7CC6010CFCB6E30DC1FE45
00016- FC *0CFC 0CFCB6E3 0D7CC6010C7CC6010CFCB6E30DC1FE45
00017- 0C 0CFC 0CFCB6E3 0D7CC6010C7CC6010CFCB6E30DC1FE45
00018- 01 *C601 *0C7CC601 0D7CC6010C7CC6010CFCB6E30DC1FE45
00019- C6 C601 0C7CC601 0D7CC6010C7CC6010CFCB6E30DC1FE45
0001A- 7C *0C7C 0C7CC601 0D7CC6010C7CC6010CFCB6E30DC1FE45
0001B- 0C 0C7C 0C7CC601 0D7CC6010C7CC6010CFCB6E30DC1FE45
0001C- 01 *C601 *0D7CC601 0D7CC6010C7CC6010CFCB6E30DC1FE45
0001D- C6 C601 0D7CC601 0D7CC6010C7CC6010CFCB6E30DC1FE45
0001E- 7C *0D7C 0D7CC601 0D7CC6010C7CC6010CFCB6E30DC1FE45
0001F- 0D 0D7C 0D7CC601 0D7CC6010C7CC6010CFCB6E30DC1FE45

* new word/long/quad

PTRA/PTRB INSTRUCTIONS
----------------------

Each cog has two 17-bit pointers, PTRA and PTRB, which can be read, written, modified,
and used to access hub memory.

At cog startup, the PTRA and PTRB registers are initialized as follows:

PTRA = %X_XXXXXXXX_XXXXXXXX, data from launching cog, usually a pointer
PTRB = %X_XXXXXXXX_XXXXXX00, long address in hub where cog code was loaded from

instructions clocks
-------------------------------------------------------------------------------------------------
000011 ZCR 1 CCCC DDDDDDDDD 000010010 GETPTRA D 'get PTRA into D, C = PTRA[16] 1
000011 ZCR 1 CCCC DDDDDDDDD 000010011 GETPTRB D 'get PTRB into D, C = PTRB[16] 1

000011 000 1 CCCC DDDDDDDDD 010110010 SETPTRA D 'set PTRA to D 1
000011 001 1 CCCC nnnnnnnnn 010110010 SETPTRA #n 'set PTRA to 0..511 1
000011 000 1 CCCC DDDDDDDDD 010110011 SETPTRB D 'set PTRB to D 1
000011 001 1 CCCC nnnnnnnnn 010110011 SETPTRB #n 'set PTRB to 0..511 1

000011 000 1 CCCC DDDDDDDDD 010110100 ADDPTRA D 'add D into PTRA 1
000011 001 1 CCCC nnnnnnnnn 010110100 ADDPTRA #n 'add 0..511 into PTRA 1
000011 000 1 CCCC DDDDDDDDD 010110101 ADDPTRB D 'add D into PTRB 1
000011 001 1 CCCC nnnnnnnnn 010110101 ADDPTRB #n 'add 0..511 into PTRB 1

000011 000 1 CCCC DDDDDDDDD 010110110 SUBPTRA D 'subtract D from PTRA 1
000011 001 1 CCCC nnnnnnnnn 010110110 SUBPTRA #n 'subtract 0..511 from PTRA 1
000011 000 1 CCCC DDDDDDDDD 010110111 SUBPTRB D 'subtract D from PTRB 1
000011 001 1 CCCC nnnnnnnnn 010110111 SUBPTRB #n 'subtract 0..511 from PTRB 1
-------------------------------------------------------------------------------------------------

QUAD-RELATED INSTRUCTIONS
-------------------------

Each cog has four QUAD registers which form a 128-bit conduit between the hub memory and the cog.
This conduit can transfer four longs every 8 clocks via the WRQUAD/RDQUAD instructions. It can
also be used as a 4-long/8-word/16-byte read cache, utilized by RDBYTEC/RDWORDC/RDLONGC/RDQUADC.

Initially hidden, these QUAD registers are mappable into cog register space by using the SETQUAD
instruction to set an address where the base register is to appear, with the other three registers
following. To hide the QUAD registers, use SETQUAD to set an address of $1FF. SETQUAZ works just
like SETQUAD, but also clears the four QUAD registers.

instructions clocks
-------------------------------------------------------------------------------------------------
000011 000 1 CCCC 000000000 000001000 CACHEX 'invalidate cache 1
000011 Z01 1 CCCC DDDDDDDDD 000010001 GETTOPS D 'get top bytes of QUADs into D 1
000011 000 1 CCCC DDDDDDDDD 011100010 SETQUAD D 'set QUAD base to D 1
000011 001 1 CCCC nnnnnnnnn 011100010 SETQUAD #n 'set QUAD base to 0..511 1
000011 010 1 CCCC DDDDDDDDD 011100010 SETQUAZ D 'set QUAD base to D, QUAD=0 1
000011 011 1 CCCC nnnnnnnnn 011100010 SETQUAZ #n 'set QUAD base to 0..511, QUAD=0 1
-------------------------------------------------------------------------------------------------

HUB CONTROL INSTRUCTIONS
------------------------

These instructions are used to control hub circuits and cogs.

Hub instructions must wait for their cog's hub cycle, which comes once every 8 clocks. In cases where
there is no result to wait for (ZCR = %000), these instructions complete on the hub cycle, making
them take 1..8 clocks, depending on where the hub cycle is in relation to the instruction. In cases
where a result is anticipated (ZCR <> %000), these instructions complete on the 1st clock after the
hub cycle, making them take 2..9 clocks.

COGINIT D,S
-----------

COGINIT is used to start cogs. Any cog can be (re)started, whether it is idle or running. A cog
can even execute a COGINIT to restart itself with a new program.

COGINIT uses D to specify a long address in hub memory that is the start of the program that is to be
loaded into a cog, while S is a 17-bit parameter (usually an address) that will be conveyed to PTRA
of the started cog. PTRB of the started cog will be set to the start address of its program that was
loaded from hub memory.

SETCOG must be executed before COGINIT to set the number of the cog to be started (0..7). If SETCOG
sets a value with bit 3 set (%1xxx), this will cause the next idle cog to be started when COGINIT is
executed, with the number of the cog started being returned in D, and the C flag returning 0 if okay,
or 1 if no idle cog was available. At cog startup, SETCOG is initialized to %0000.

When a cog is started, $1F8 contiguous longs are read from hub memory and written to cog registers
$000..$1F7. The cog will then begin execution at $000. This process takes 1,016 clocks.

Example:

COGID COGNUM 'what cog am I?
SETCOG COGNUM 'set my cog number
COGINIT COGPGM,COGPTR 'restart me with the ROM Monitor

COGPGM LONG $0070C 'address of the ROM Monitor
COGPTR LONG 90<<9 + 91 'tx = P90, rx = P91

COGNUM RES 1

CLKSET D
---------

CLKSET writes the lower 9 bits of D to the hub clock register:

%R_MMMM_XX_SS

R = 1 for hardware reset, 0 for continued operation

MMMM = PLL mode:

%0000 for disabled, else XX must be set for XI input or XI/XO crystal oscillator
%0001 for multiply XI by 2
%0010 for multiply XI by 3
%0011 for multiply XI by 4
%0100 for multiply XI by 5
%0101 for multiply XI by 6
%0110 for multiply XI by 7
%0111 for multiply XI by 8
%1000 for multiply XI by 9
%1001 for multiply XI by 10
%1010 for multiply XI by 11
%1011 for multiply XI by 12
%1100 for multiply XI by 13
%1101 for multiply XI by 14
%1110 for multiply XI by 15
%1111 for multiply XI by 16

XX = XI/XO pin mode:

%00 for XI reads low, XO floats
%01 for XI input, XO floats
%10 for XI/XO crystal oscillator with 15pF internal loading and 1M-ohm feedback
%11 for XI/XO crystal oscillator with 30pF internal loading and 1M-ohm feedback

SS = Clock selector:

%00 for RCFAST (~20MHz)
%01 for RCSLOW (~20KHz)
%10 for XTAL (10MHz-20MHz)
%11 for PLL

Because the the clock register is cleared to %0_0000_00_00 on reset, the chip starts up in RCFAST mode
with both the crystal oscillator and the PLL disabled. Before switching to XTAL or PLL mode from RCFAST
or RCSLOW, the crystal oscillator must be enabled and given 10ms to stabilize. The PLL stabilizes within
10us, so it can be enbled at the sime time as the crystal oscillator. Once the crystal is stabilized, you
can switch between XTAL and RCFAST/RCSLOW without any stability concerns. If the PLL is also enabled, you
can switch freely among PLL, XTAL, and RCFAST/RCSLOW modes. You can change the PLL multiplier while being
in PLL mode, but beware that some frequency overshoot and undershoot will occur as the PLL settles to its
new frequency. This only poses a hardware problem if you are switching upwards and the resulting overshoot
might exceed the speed limit of the chip.

COGID D
---------

COGID returns the number of the cog (0..7) into D.

COGSTOP D
---------

COGSTOP stops the cog specified in D (0..7).

LOCKNEW D
LOCKRET D
LOCKSET D
LOCKCLR D
---------

There are eight semaphore locks available in the chip which can be borrowed with LOCKNEW, returned with
LOCKRET, set with LOCKSET, and cleared with LOCKCLR.

While any cog can set or clear any lock without using LOCKNEW or LOCKRET, LOCKNEW and LOCKRET are provided
so that cog programs have a dynamic and simple means of acquiring and relinquishing the locks at run-time.

When a lock is set with LOCKSET, its state is set to 1 and its prior state is returned in C. LOCKCLR works
the same way, but clears the lock's state to 0. By having the hub perform the atomic operation of setting/
clearing and reporting the prior state, cogs can utilize locks to insure that only one cog has permission
to do something at once. If a lock starts out cleared and multiple cogs vie for the lock by doing a
'LOCKSET locknum wc', the cog to get C=0 back 'wins' and he can have exclusive access to some shared
resource while the other cogs get C=1 back. When the winning cog is done, he can do a 'LOCKCLR locknum' to
clear the lock and give another cog the opportunity to get C=0 back.

LOCKNEW returns the next available lock into D, with C=1 if no lock was free.

LOCKRET frees the lock in D so that it can be checked out again by LOCKNEW.

LOCKSET sets the lock in D and returns its prior state in C.

LOCKCLR clears the lock in D and returns its prior state in C.

instructions clocks
-------------------------------------------------------------------------------------------------
000011 ZCR 0 CCCC DDDDDDDDD SSSSSSSSS COGINIT D,S 'launch cog at D, cog PTRA = S 1..9
000011 000 1 CCCC DDDDDDDDD 000000000 CLKSET D 'set clock to D 1..8
000011 001 1 CCCC DDDDDDDDD 000000001 COGID D 'get cog number into D 2..9
000011 000 1 CCCC DDDDDDDDD 000000011 COGSTOP D 'stop cog in D 1..8
000011 ZC1 1 CCCC DDDDDDDDD 000000100 LOCKNEW D 'get new lock into D, C = busy 2..9
000011 000 1 CCCC DDDDDDDDD 000000101 LOCKRET D 'return lock in D 1..8
000011 0C0 1 CCCC DDDDDDDDD 000000110 LOCKSET D 'set lock in D, C = prev state 1..9
000011 0C0 1 CCCC DDDDDDDDD 000000111 LOCKCLR D 'clear lock in D, C = prev state 1..9
-------------------------------------------------------------------------------------------------

INDIRECT REGISTERS
------------------

Each cog has two indirect registers: INDA and INDB. They are located at $1F6 and $1F7.

By using INDA or INDB for D or S, the register pointed at by INDA or INDB is addressed.

INDA and INDB each have three hidden 9-bit registers associated with them: the pointer, the bottom limit, and
the top limit. The bottom and top limits are inclusive values which set automatic wrapping boundaries for the
pointer. This way, circular buffers can be established within cog RAM and accessed using simple INDA/INDB
references.

SETINDA/SETINDB/SETINDS is used to set or adjust the pointer value(s) while forcing the associated bottom and
top limit(s) to $000 and $1FF, respectively.

FIXINDA/FIXINDB/FIXINDS sets the pointer(s) to an inital value, while setting the bottom limit(s) to the
lower of the initial and terminal values and the top limit(s) to the higher.

Because indirect addressing must occur in the 2nd stage of the pipeline, long before C and Z are valid for
conditional execution in the 4th stage, all instructions which use indirect addressing are forced to always
execute. This frees the conditional bit field (CCCC) for specifying indirect operations. The top two bits of
CCCC are used for indirect D and the bottom two bits are used for indirect S. If only D or S is indirect, the
other two bits in CCCC are ignored.

Here is the INDA/INDB usage scheme which repurposes the CCCC field:

OOOOOO ZCR I CCCC DDDDDDDDD SSSSSSSSS
-------------------------------------
xxxxxx xxx x 00xx 111110110 xxxxxxxxx D = INDA 'use INDA
xxxxxx xxx x 00xx 111110111 xxxxxxxxx D = INDB 'use INDB
xxxxxx xxx x 01xx 111110110 xxxxxxxxx D = INDA++ 'use INDA, INDA += 1
xxxxxx xxx x 01xx 111110111 xxxxxxxxx D = INDB++ 'use INDB, INDB += 1
xxxxxx xxx x 10xx 111110110 xxxxxxxxx D = INDA-- 'use INDA, INDA -= 1
xxxxxx xxx x 10xx 111110111 xxxxxxxxx D = INDB-- 'use INDB INDB -= 1
xxxxxx xxx x 11xx 111110110 xxxxxxxxx D = ++INDA 'use INDA+1, INDA += 1
xxxxxx xxx x 11xx 111110111 xxxxxxxxx D = ++INDB 'use INDB+1, INDB += 1

xxxxxx xxx 0 xx00 xxxxxxxxx 111110110 S = INDA 'use INDA
xxxxxx xxx 0 xx00 xxxxxxxxx 111110111 S = INDB 'use INDB
xxxxxx xxx 0 xx01 xxxxxxxxx 111110110 S = INDA++ 'use INDA, INDA += 1
xxxxxx xxx 0 xx01 xxxxxxxxx 111110111 S = INDB++ 'use INDB, INDB += 1
xxxxxx xxx 0 xx10 xxxxxxxxx 111110110 S = INDA-- 'use INDA, INDA -= 1
xxxxxx xxx 0 xx10 xxxxxxxxx 111110111 S = INDB-- 'use INDB INDB -= 1
xxxxxx xxx 0 xx11 xxxxxxxxx 111110110 S = ++INDA 'use INDA+1, INDA += 1
xxxxxx xxx 0 xx11 xxxxxxxxx 111110111 S = ++INDB 'use INDB+1, INDB += 1

If both D and S are the same indirect register, the two 2-bit fields in CCCC are OR'd together to get the
post-modifier effect:

101000 001 0 0011 111110110 111110110 MOV INDA,++INDA 'Move @INDA+1 into @INDA, INDA += 1
100000 001 0 1100 111110111 111110111 ADD ++INDB,INDB 'Add @INDB into @INDB+1, INDB += 1

Note that only '++INDx,INDx'/'INDx,++INDx' combinations can address different registers from the same INDx.

Here are the instructions which are used to set the pointer and limit values for INDA and INDB:

instructions * clocks
-------------------------------------------------------------------------------------------------
111000 000 0 0001 000000000 AAAAAAAAA SETINDA #addrA 1
111000 000 0 0011 000000000 AAAAAAAAA SETINDA ++/--deltA 1

111000 000 0 0100 BBBBBBBBB 000000000 SETINDB #addrB 1
111000 000 0 1100 BBBBBBBBB 000000000 SETINDB ++/--deltB 1

111000 000 0 0101 BBBBBBBBB AAAAAAAAA SETINDS #addrB,#addrA 1
111000 000 0 0111 BBBBBBBBB AAAAAAAAA SETINDS #addrB,++/--deltA 1
111000 000 0 1101 BBBBBBBBB AAAAAAAAA SETINDS ++/--deltB,#addrA 1
111000 000 0 1111 BBBBBBBBB AAAAAAAAA SETINDS ++/--deltB,++/--deltA 1

111001 000 0 0001 TTTTTTTTT IIIIIIIII FIXINDA #terminal,#initial 1
111001 000 0 0100 TTTTTTTTT IIIIIIIII FIXINDB #terminal,#initial 1
111001 000 0 0101 TTTTTTTTT IIIIIIIII FIXINDS #terminal,#initial 1
-------------------------------------------------------------------------------------------------
* addrA/addrB/terminal/initial = register address (0..511),
deltA/deltB = 9-bit signed delta --256..++255

Examples:

111000 000 0 0001 000000000 000000101 SETINDA #5 'INDA = 5, bottom = 0, top = 511
111000 000 0 0011 000000000 000000011 SETINDA ++3 'INDA += 3, bottom = 0, top = 511
111000 000 0 1100 111111100 000000000 SETINDB --4 'INDB -= 4, bottom = 0, top = 511
111000 000 0 0111 000000111 000001000 SETINDS #7,++8 'INDB = 7, INDA += 8, bottoms = 0, tops = 511

111001 000 0 0001 000001111 000001000 FIXINDA #15,#8 'INDA = 8, bottom = 8, top = 15
111001 000 0 0100 000010000 000011111 FIXINDB #16,#31 'INDB = 31, bottom = 16, top = 31
111001 000 0 0101 001100011 000110010 FIXINDS #99,#50 'INDA/INDB = 50, bottoms = 50, tops = 99

STACK RAM
---------

Each cog has a 256-long stack RAM that is accessible via push and pop operations. Its contents
are not initialized at either reset or cog startup. So, at cog startup, it will contain whatever
it happened to power up with, or whatever was last written.

There are two stack pointers called SPA and SPB which are used to address the stack memory. Aside
from automatically incrementing and decrementing via pushes and pops, SPA and SPB can be set,
modified, read back, and checked:

SETSPA D/#n set SPA
SETSPB D/#n set SPB
ADDSPA D/#n add to SPA
ADDSPB D/#n add to SPB
SUBSPA D/#n subtract from SPA
SUBSPB D/#n subtract from SPB
GETSPA D get SPA, SPA==0 into Z, SPA.7 into C
GETSPB D get SPB, SPB==0 into Z, SPB.7 into C
GETSPD D get SPA minus SPB, SPA==SPB into Z, SPA<SPB into C
CHKSPA check SPA, SPA==0 into Z, SPA.7 into C
CHKSPB check SPB, SPB==0 into Z, SPB.7 into C
CHKSPD check SPA minus SPB, SPA==SPB into Z, SPA<SPB into C

Data can be pushed and popped in both normal and reverse directions:

PUSHA D/#n push using SPA
PUSHB D/#n push using SPB
PUSHAR D/#n push using SPA, use pop addressing
PUSHBR D/#n push using SPB, use pop addressing
POPA D pop using SPA
POPB D pop using SPB
POPAR D pop using SPA, use push addressing
POPBR D pop using SPB, use push addressing

Aside from data, the program counter and flags can be pushed and popped using calls and returns:

CALLA D/#n call using SPA, zeros/Z/C/PC+1 are written @SPA, SPA += 1
CALLB D/#n call using SPB, zeros/Z/C/PC+1 are written @SPB, SPB += 1
RETA return using SPA, Z/C/PC are read @SPA-1, SPA -= 1, if WZ/WC then Z/C updated
RETB return using SPB, Z/C/PC are read @SPB-1, SPB -= 1, if WZ/WC then Z/C updated

CALLAD/CALLBD/RETAD/RETBD are delayed versions of CALLA/CALLB/RETA/RETB.

SPA and SPB are both initialized to 0 at cog startup.

instructions (stack RAM access is shown as [SPx++] and [--SPx]) clocks
-------------------------------------------------------------------------------------------------
000011 ZC0 1 CCCC 000000000 000010101 CHKSPD 'SPA==SPB into Z, SPA<SPB into C 1
000011 ZC1 1 CCCC DDDDDDDDD 000010101 GETSPD D 'SPA-SPB into D, Z/C as CHKSPD 1

000011 ZC0 1 CCCC 000000000 000010110 CHKSPA 'SPA==0 into Z, SPA.7 into C 1
000011 ZC1 1 CCCC DDDDDDDDD 000010110 GETSPA D 'SPA into D, Z/C as CHKSPA 1

000011 ZC0 1 CCCC 000000000 000010111 CHKSPB 'SPB==0 into Z, SPB.7 into C 1
000011 ZC1 1 CCCC DDDDDDDDD 000010111 GETSPB D 'SPB into D, Z/C as CHKSPB 1

000011 ZC1 1 CCCC DDDDDDDDD 000011000 POPAR D 'read [SPA++] into D, MSB into C 1
000011 ZC1 1 CCCC DDDDDDDDD 000011001 POPBR D 'read [SPB++] into D, MSB into C 1

000011 ZC1 1 CCCC DDDDDDDDD 000011010 POPA D 'read [--SPA] into D, MSB into C 1
000011 ZC1 1 CCCC DDDDDDDDD 000011011 POPB D 'read [--SPB] into D, MSB into C 1

000011 ZC0 1 CCCC 000000000 000011100 RETA 'read [--SPA] into Z/C/PC* 4
000011 ZC0 1 CCCC 000000000 000011101 RETB 'read [--SPB] into Z/C/PC* 4

000011 ZC0 1 CCCC 000000000 000011110 RETAD 'read [--SPA] into Z/C/PC* 1
000011 ZC0 1 CCCC 000000000 000011111 RETBD 'read [--SPB] into Z/C/PC* 1

000011 000 1 CCCC DDDDDDDDD 010100010 SETSPA D 'set SPA to D 1
000011 001 1 CCCC 0nnnnnnnn 010100010 SETSPA #n 'set SPA to n 1
000011 000 1 CCCC DDDDDDDDD 010100011 SETSPB D 'set SPB to D 1
000011 001 1 CCCC 0nnnnnnnn 010100011 SETSPB #n 'set SPB to n 1

000011 000 1 CCCC DDDDDDDDD 010100100 ADDSPA D 'add D into SPA 1
000011 001 1 CCCC 0nnnnnnnn 010100100 ADDSPA #n 'add n into SPA 1
000011 000 1 CCCC DDDDDDDDD 010100101 ADDSPB D 'add D into SPB 1
000011 001 1 CCCC 0nnnnnnnn 010100101 ADDSPB #n 'add n into SPB 1

000011 000 1 CCCC DDDDDDDDD 010100110 SUBSPA D 'subtract D from SPA 1
000011 001 1 CCCC 0nnnnnnnn 010100110 SUBSPA #n 'subtract n from SPA 1
000011 000 1 CCCC DDDDDDDDD 010100111 SUBSPB D 'subtract D from SPB 1
000011 001 1 CCCC 0nnnnnnnn 010100111 SUBSPB #n 'subtract n from SPB 1

000011 000 1 CCCC DDDDDDDDD 010101000 PUSHAR D 'write D into [--SPA] 1 **
000011 001 1 CCCC nnnnnnnnn 010101000 PUSHAR #n 'write n into [--SPA] 1 **
000011 000 1 CCCC DDDDDDDDD 010101001 PUSHBR D 'write D into [--SPB] 1 **
000011 001 1 CCCC nnnnnnnnn 010101001 PUSHBR #n 'write n into [--SPB] 1 **

000011 000 1 CCCC DDDDDDDDD 010101010 PUSHA D 'write D into [SPA++] 1 **
000011 001 1 CCCC nnnnnnnnn 010101010 PUSHA #n 'write n into [SPA++] 1 **
000011 000 1 CCCC DDDDDDDDD 010101011 PUSHB D 'write D into [SPB++] 1 **
000011 001 1 CCCC nnnnnnnnn 010101011 PUSHB #n 'write n into [SPB++] 1 **

000011 000 1 CCCC DDDDDDDDD 010101100 CALLA D 'write Z/C/PC* into [SPA++], PC=D 4 **
000011 001 1 CCCC nnnnnnnnn 010101100 CALLA #n 'write Z/C/PC* into [SPA++], PC=n 4 **
000011 000 1 CCCC DDDDDDDDD 010101101 CALLB D 'write Z/C/PC* into [SPB++], PC=D 4 **
000011 001 1 CCCC nnnnnnnnn 010101101 CALLB #n 'write Z/C/PC* into [SPB++], PC=n 4 **

000011 000 1 CCCC DDDDDDDDD 010101110 CALLAD D 'write Z/C/PC* into [SPA++], PC=D 1 **
000011 001 1 CCCC nnnnnnnnn 010101110 CALLAD #n 'write Z/C/PC* into [SPA++], PC=n 1 **
000011 000 1 CCCC DDDDDDDDD 010101111 CALLBD D 'write Z/C/PC* into [SPB++], PC=D 1 **
000011 001 1 CCCC nnnnnnnnn 010101111 CALLBD #n 'write Z/C/PC* into [SPB++], PC=n 1 **
-------------------------------------------------------------------------------------------------
* bit 10 is Z, bit 9 is C, bits 8..0 are PC, upper bits are ignored or cleared
** if a stack RAM write is immediately followed by a stack RAM read, add one clock

BYTE/WORD FIELD MOVER
---------------------

Each cog has a field mover that can move a byte or word from any field in S into any field in D. To use
the field mover, you must first configure it using SETF. Then, you can use MOVF to perform the moves.

SETF uses a 9-bit value to configure the field mover:

%W_DDdd_SSss

W = 1 for word mode, 0 for byte mode

DD = D field mode: %00 = D field pointer stays same after MOVF
%01 = D field pointer stays same after MOVF, D rotates left by byte/word
%10 = D field pointer increments after MOVF
%11 = D field pointer decrements after MOVF

dd = D field pointer: %00 = byte 0 / word 0
%01 = byte 1 / word 0
%10 = byte 2 / word 1
%11 = byte 3 / word 1

SS = S field mode: %0x = S field pointer stays same after MOVF
%10 = S field pointer increments after MOVF
%11 = S field pointer decrements after MOVF

ss = S field pointer: %00 = byte 0 / word 0
%01 = byte 1 / word 0
%10 = byte 2 / word 1
%11 = byte 3 / word 1

On cog startup, SETF is initialized to %0_0100_0000, so that MOVF will rotate D left by 8 bits and
then fill the bottom byte with the lower byte in S.

instructions clocks
-------------------------------------------------------------------------------------------------
000011 000 1 CCCC DDDDDDDDD 011001010 SETF D 'Configure field mover with D 1
000011 001 1 CCCC nnnnnnnnn 011001010 SETF #n 'Configure field mover with 0..511 1
000101 000 0 CCCC DDDDDDDDD SSSSSSSSS MOVF D,S 'Move field from S into D 1
000101 000 1 CCCC DDDDDDDDD nnnnnnnnn MOVF D,#n 'Move field from 0..511 into D 1
-------------------------------------------------------------------------------------------------

MULTI-TASKING
-------------

Each cog has four sets of flags and program counters (Z/C/PC), constituting four unique tasks that
can execute and switch on each instruction cycle.

At cog startup, the tasks are initialized as follows:

task Z C PC
---------------
0 0 0 $000
1 0 0 $001
2 0 0 $002
3 0 0 $003

There are 16 rotating time slots in the TASK register that determine task sequence. Initially, all
time slots are set to 0, causing task 0 to execute exclusively, starting at address $000:

time slots: 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
| | | | | | | | | | | | | | | |
TASK register: %00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00

The two LSB's of TASK always determine which task will execute next. After each instruction cycle,
the TASK register is rotated right by two bits, recycling slot 0 to slot 15 and getting the next task
into the 2 LSB's.

To enable other tasks, SETTASK is used to set the TASK register:

SETTASK D write D to the TASK register
SETTASK #n write {n[7:0], n[7:0], n[7:0], n[7:0]} to the TASK register

If a task is given no time slot, it doesn't execute and its flags and PC stay at initial values. If a
task is given a time slot, it will execute and its flags and PC will be updated at every instruction,
or time slot. If an active task's time slots are all taken away, that task's flags and PC remain in the
state where they left off, until it is given another time slot.

To immediately force any of the four PC's to a new address, JMPTASK can be used. JMPTASK uses a 4-bit
mask to select which PC's are going to be written. Mask bits 0..3 represent PC's 0..3. The mask value
%1010 would write PC 3 and PC 1, while %0100 would write PC 2, only.

JMPTASK D,#mask force PC's in mask to D
JMPTASK #addr,#mask force PC's in mask to #addr

For every PC/task affected by a JMPTASK instruction, all affected-task instructions currently in the
pipeline are cancelled. This insures that once JMPTASK executes, the next instruction from each
affected task will be from the new address.

Here is an example in which all four tasks are started and each task toggles an I/O pin at a
different rate:

ORG

JMP #task0 'task 0 begins here when the cog starts (this JMP takes 4 clocks)
JMP #task1 'task 1 begins here after task 0 executes SETTASK (this JMP takes 1 clock)
JMP #task2 'task 2 begins here after task 0 executes SETTASK (this JMP takes 1 clock)
JMP #task3 'task 3 begins here after task 0 executes SETTASK (this JMP takes 1 clock)

task0 SETTASK #%%3210 'enable all tasks (TASK = %11_10_01_00_11_10_01_00_11_10_01_00_11_10_01_00)

:loop NOTP #0 'task 0, toggle pin 0 - loops every 8 clocks
JMP #:loop '(this JMP takes 1 clock)

task1 NOTP #1 'task 1, toggle pin 1 - loops every 12 clocks
NOP
JMP #task1 '(this JMP takes 1 clock)

task2 NOTP #2 'task 2, toggle pin 2 - loops every 16 clocks
NOP
NOP
JMP #task2 '(this JMP takes 1 clock)

task3 NOTP #3 'task 3, toggle pin 3 - loops every 20 clocks
NOP
NOP
NOP
JMP #task3 '(this JMP takes 1 clock)

------------------------------------------------------------------------------------------------------------
NOTE: When a normal branch instruction (JMP, CALL, RET, etc.) executes in the 4th and final stage of the
pipeline, all instructions progressing through the lower three stages, which belong to the same task as the
branch instruction, are cancelled. This inhibits execution of incidental data that was trailing the branch
instruction.

The delayed branch instructions (JMPD, CALLD, RETD, etc.) don't do any pipeline instruction cancellation and
exist to provide 1-clock branches to single-task programs, where the three instructions following the branch
are allowed to execute before the new instruction stream begins to execute.

For single-task programs, normal branches take 4 clocks: 1 clock for the branch and 3 clocks for the
cancelled instructions to come through the pipeline before the new instruction stream begins to execute.

For multi-tasking programs that use all four tasks in sequence (ie SETTASK #%%3210), there are never any
same-task instructions in the pipeline that would require cancellation due to branching, so all branches
take just 1 clock.
------------------------------------------------------------------------------------------------------------

Tips for coding multi-tasking programs
--------------------------------------

While all tasks in a multi-tasking program can execute atomic instructions without any inter-task conflict,
remember that there's only one of each of the following cog resources and only one task can use it at a time:

SPA
SPB
INDA
INDB
PTRA
PTRB
ACCA
ACCB
32x32 multiplier
64/32 divider
64-bit square rooter
CORDIC computer
CTRA
CTRB
VID
PIX (not usable in multi-tasking, requires single-task timing)
XFR
SER
REPS/REPD
SETF/MOVF

When writing multi-task programs, be aware that instructions that take multiple clocks will stall the
pipeline and have a ripple effect on the tasks' timing. This may be impossible to avoid, as some task
might need to access hub memory, and those instructions are not single-clock.

The WAITCNT/WAITPEQ/WAITPNE instructions should be recoded discretely using 1-clock instructions, to
avoid stalling the pipeline for excessive amounts of time.

The following instructions (WC versions) will take 1 clock, instead of potentially many, and return 1 in
C if they were successful:

SNDSER D WC attempt to send serial
RCVSER D WC attempt to receive serial
GETMULL D WC attempt to get lower multiplier result
GETMULH D WC attempt to get upper multiplier result
GETDIVQ D WC attempt to get divider quotient result
GETDIVR D WC attempt to get divider remainder result
GETSQRT D WC attempt to get square root result
GETQX D WC attempt to get CORDIC X result
GETQY D WC attempt to get CORDIC Y result
GETQZ D WC attempt to get CORDIC Z result

Other instruction alternatives:

POLCTRA WC returns 1 in C if CTRA rolled over, use instead of SYNCTRA
POLCTRB WC returns 1 in C if CTRB rolled over, use instead of SYNCTRB
POLVID WC returns 1 in C if WAITVID is ready, use to execute WAITVID without stalling
PASSCNT D jumps to itself if some amount of time has not passed, use instead of WAITCNT
JP/JNP D,S jumps based on pin states, use instead of WAITPEQ/WAITPNE
DJNZ D,#$ loops until done, use instead of NOP D/#n

The following instruction will not work in a multi-tasking program:

GETPIX needs steady pipeline delays for perspective divider time - single-task only

instructions clocks
-------------------------------------------------------------------------------------------------
000011 000 1 CCCC DDDDDDDDD 01001mmmm JMPTASK D,#mask 'Set PC's in mask to D 1
000011 001 1 CCCC nnnnnnnnn 01001mmmm JMPTASK #n,#mask 'Set PC's in mask to 0..511 1

000011 000 1 CCCC DDDDDDDDD 011001011 SETTASK D 'Set TASK to D 1
000011 001 1 CCCC 0nnnnnnnn 011001011 SETTASK #n 'Set TASK to n[7:0] copied 4x 1
-------------------------------------------------------------------------------------------------

PIPELINE
--------

Each cog has a 4-stage pipeline which all instructions progress through, in order to execute:

1st stage - Read instruction from cog register RAM

2nd stage - Determine any indirect or remapped D and S addresses within instruction
Update INDA and INDB

3rd stage - Read D and S from cog register RAM

4th stage - Execute instruction using D and S
Write any D result to cog register RAM
Update Z/C/PC and any other results

On every clock cycle, the instruction data in each stage advances to the next stage, unless the instruction
in the 4th stage is stalling the pipeline because it's waiting for something (i.e. WRBYTE waits for the hub).

To keep D and S data current within the pipeline, the resultant D from the 4th stage is passed back to
the 3rd stage to substitute for any obsoleted D or S data currently being read from the cog register RAM.
The same is done for instruction data currently being read in the 1st stage, but this still leaves a two-
stage gap between when a register is modified and when it can be executed:

'single-task self-modifying code

MOVI :inst,top9 '(initially 4th stage) modify instruction
NOP '(initially 3rd stage) 1...
NOP '(initially 2nd stage) 2... at least two instructions in-between
:inst ADD A,B '(initially 1st stage) modified instruction properly executes

Tasks that execute no more frequently than every 3rd time slot don't need to observe this 2-instruction
spacer rule when executing self-modifying code, because their instructions will always be sufficiently spread
apart in the pipeline by other tasks' instructions, enabling a just-modified instruction to be properly read
and executed in that task's next time slot. If less than two spacers are afforded to a modify-execute sequence,
the old instruction will be read and executed, instead of the newly-modified one. This can be used to advantage
for efficient overlapped modify-execute sequences.

When a branch instruction executes, that task's program counter is abruptly changed from what had been a
steadily incrementing course, requiring that the pipeline be reloaded, beginning at the new program counter
address. This can leave up to three instructions in the pipeline which were trailing the branch instruction
and belong to the same task as the branch.

Normally, these trailing instructions are incidental data which are not intended for execution, and therefore
must be cancelled within the pipeline, so that they pass through without doing anything. However, in the case
of a single-task program, it may be desirable to allow those instrucions to execute, without cancellation, to
increase pipeline efficiency. This will result in the branch taking just 1 clock cycle, but three trailing
instructions will be executed before the branch appears to take effect:

'single-task delayed branch

JMPD #somewhere '(initially 4th stage) do a delayed jmp, then toggle P0 and cycle P1
NOTP #0 '(initially 3rd stage)
NOTP #1 '(initially 2nd stage)
NOTP #1 '(initially 1st stage) next instruction is loaded from 'somewhere'

To accommodate both cancelling and non-cancelling branches, branch instructions have two versions. The ones
that end in the letter 'D' for 'delayed' are non-cancelling and take only one clock, and are intended only for
use in single-task programs.

The branch instructions that don't end in the letter 'D' are what would be considered 'normal' branches, as
they cancel any same-task instructions in the pipeline, so that the next instruction to execute after the
branch would be the instruction which was branched to.

Here are all the branching instructions:

'normal' 'delayed'
cancelling non-cancelling
---------- --------------
JMP JMPD jump to address
CALL CALLD call subroutine
RET RETD return from subroutine
JMPRET JMPRETD general case branch instruction
TASKSW TASKSWD switch between threads
CALLA CALLAD call using stack @SPA
CALLB CALLBD call using stack @SPB
RETA RETAD return using stack @SPA
RETB RETBD return using stack @SPB
IJZ IJZD increment D and jump if result zero
IJNZ IJNZD increment D and jump if result not zero
DJZ DJZD decrement D and jump if result zero
DJNZ DJNZD decrement D and jump if result not zero
TJZ TJZD test D and jump if result zero
TJNZ TJNZD test D and jump if result not zero
JP JPD jump if pin D reads high
JNP JNPD jump if pin D reads low

PASSCNT loop until CNTL passes D
JMPTASK jump selected tasks to address

INSTRUCTION-BLOCK REPEATING
---------------------------

Each cog has an instruction-block repeater that can variably repeat up to 64 instructions without
any clock-cycle overhead.

REPD and REPS are used to initiate block repeats. These instructions specify how many times the
trailing instruction block will be executed and how many instructions are in the block:

REPD #i - execute 1..64 instructions infinitely, requires 3 spacer instructions *
REPD D,#i - execute 1..64 instructions D+1 times, requires 3 spacer instructions *
REPD #n,#i - execute 1..64 instructions 1..512 times, requires 3 spacer instructions *

REPS #n,#i - execute 1..64 instructions 1..16384 times, requires 1 spacer instruction *

REPS differs from REPD by executing at the 2nd stage of the pipeline, instead of the 4th. By
executing two stages earlier, it needs only one spacer instruction *. Because of its earliness,
no conditional execution is possible, so it is forced to always execute, allowing the CCCC bits
to be repurposed, along with Z, to provide a 14-bit constant for the repeat count.

The instruction-block repeater will quit repeating the block if a branch instruction executes
within the block. Care must be taken, though, if using JMPTASK to affect a task which may be
using the block repeater, as it will not cancel the block repeater. To get around this, the block
repeater can be benignly reassigned to the task doing the JMPTASK, before the JMPTASK executes:

REPS #1,#1 'effectively cancel the block repeater
JMPTASK D/#n,#mask 'now do the JMPTASK

* Spacer instructions are required in 1-task applications to allow the pipeline to prime before
repeating can commence. If REPD is used by a task that uses no more than every 4th time slot, no
spacers are needed, as three intervening instructions will be provided by the other task(s). If
REPS is used by a task that uses no more than every 2nd time slot, no spacers are needed.

Example (1-task):

REPD D,#1 'execute 1 instruction D+1 times

NOP '3 spacer instructions needed (could do something useful)
NOP
NOP

NOTP #0 'toggle P0, block repeats every 1 clock

Example (1-task):

REPS #20_000,#4 'execute 4 instructions 20,000 times

NOP '1 spacer instruction needed (make the most of it)

NOTP #0 'toggle P0
NOTP #1 'toggle P1
NOTP #2 'toggle P2
NOTP #3 'toggle P3, block repeats every 4 clocks

Example (4-task, SETTASK #%%3210 timing):

task0 REPD #1 'task0 will own the block repeater (no need for spacers)
NOTP #0 'toggle P0 every 4 clocks

task1 NOTP #1 'toggle P1 every 8 clocks
JMP #task1

task2 NOTP #2 'toggle P2 every 8 clocks
JMP #task2

task3 NOTP #3 'toggle P3 every 8 clocks
JMP #task3

instructions (iiiiii = #i-1, nnnnnnnnn/n___nnnn_nnnnnnnnn = #n-1) clocks
----------------------------------------------------------------------------------------------------
000011 000 1 CCCC 111111111 001iiiiii REPD #i 'execute 1..64 inst's infintely 1
000011 000 1 CCCC nnnnnnnnn 001iiiiii REPD D,#i 'execute 1..64 inst's D+1 times 1
000011 001 1 CCCC nnnnnnnnn 001iiiiii REPD #n,#i 'execute 1..64 inst's 1..512 times 1
000011 n11 1 nnnn nnnnnnnnn 001iiiiii REPS #n,#i 'execute 1..64 inst's 1..16384 times 1
----------------------------------------------------------------------------------------------------
Note that the %iiiiii field represents 1..64 instructions, not the encoded 0..63. The %nnnnnnnnn/
%n___nnnn_nnnnnnnnn fields are +1-based, too.

HUB COUNTER
-----------

The hub contains a 64-bit counter called CNT that increments on each clock cycle. Each cog can use CNT
to mark time in various ways. On chip reset, the ROM Booter initializes CNT to $00000000_00000000. For
the purpose of describing the cog instructions which relate to CNT, the lower long of CNT is alternately
called CNTL and the upper long, delayed by one clock cycle, is called CNTH. The one-clock delay of CNTH
enables proper reading of the entire CNT value when two instructions must be used in sequence to access
its bottom and top longs.

Here are the instructions which relate to CNT:

GETCNT D Get CNTL into D. If another GETCNT is executed in the next clock cycle by the
same task, it gets CNTH into D.

SUBCNT D Get CNTL minus D into D. If another SUBCNT is executed in the next clock cycle
by the same task, it gets CNTH minus D minus carry from previous SUBCNT into D.
In either case, the logical not of the MSB of the D result (not the carry) goes
into C, indicating by C=1 if CNTL (or CNT) has exceeded the original D value(s).

CMPCNT D Same as SUBCNT, but doesn't store the D result(s). Useful for periodic checking
if a time target has been reached yet.

PASSCNT D Jump to self if MSB of CNTL minus D is 1. In other words, loop until CNTL
exceeds D. This is intended as a non-pipeline-stalling alternative to WAITCNT,
for use in multi-task programs.

WAITCNT D,S/#n Wait for CNTL to be equal to D. Adds S/#n into D.

WAITCNT D,S/#n WC Wait for CNT to be equal to the concatenation of the last-written D value and
the D expressed in the WAITCNT. Adds S/#n into D. Carry from the addition goes
into C.

WAITPEQ D,S/#n WC Like WAITPEQ without WC, except the last-written D value becomes a CNTL timeout
target, with C returning 0 if the WAITPEQ condition was met, or 1 if the timeout
occurred first.

WAITPNE D,S/#n WC Like WAITPNE without WC, except the last-written D value becomes a CNTL timeout
target, with C returning 0 if the WAITPNE condition was met, or 1 if the timeout
occurred first.

Examples:

'Measure time using lower 32 bits of CNT

GETCNT ticks 'get CNTL into ticks
<somecode> 'execute some code
SUBCNT ticks 'get CNTL minus ticks into ticks, <somecode> took ticks-1 to execute

'Measure time using full 64 bits of CNT (single task)

GETCNT ticks_low 'get CNT into {ticks_high, ticks_low}
GETCNT ticks_high
<somecode> 'execute some code
SUBCNT ticks_low 'get CNT minus {ticks_high, ticks_low} into {ticks_high, ticks_low}
SUBCNT ticks_high '<somecode> took {ticks_high, ticks_low}-1 clocks to execute

'Do something for some time

GETCNT ticks 'get CNTL
ADD ticks,#500 'add 500

loop <somecode> 'execute some code
CMPCNT ticks WC 'check if 500 clocks have elapsed yet
if_nc JMP #loop 'if not, loop

'Do something every Nth clock (multi-task)

GETCNT ticks 'get CNTL

loop ADD ticks,#500 'add 500
PASSCNT ticks 'wait for next 500th clock
<somecode> 'execute some code
jmp #loop 'loop

'Do something every Nth clock (single-task)

GETCNT ticks 'get CNTL
ADD ticks,#500 'add initial 500

loop WAITCNT ticks,#500 'wait for next 500th clock, add next 500
<somecode> 'execute some code
jmp #loop 'loop

'Wait for pins to equal a value, with time-out

GETCNT ticks 'get CNTL
ADD ticks,#200 'allow 200 clock cycles for WAITPEQ (CNTL target is last-stored value)
WAITPEQ value,mask WC 'wait for (pins & mask) = value
if_c JMP #timeout 'if C=1 then timeout occurred, else pin condition was met

instructions clocks
----------------------------------------------------------------------------------------------------
000011 ZC0 1 CCCC DDDDDDDDD 000001100 CMPCNT D 'compares D to CNTL, C = D > CNTL 1
000011 ZC1 1 CCCC DDDDDDDDD 000001100 SUBCNT D 'subtracts D from CNTL, then CNTH 1
000011 000 1 CCCC DDDDDDDDD 000001101 PASSCNT D 'loops until CNTL passes D 1*
000011 001 1 CCCC DDDDDDDDD 000001101 GETCNT D 'gets CNTL, then CNTH 1
111111 0CR I CCCC DDDDDDDDD SSSSSSSSS WAITCNT D,S 'wait for CNTL or CNT (WC), D += S ?
111111 110 I CCCC DDDDDDDDD SSSSSSSSS WAITPEQ D,S WC 'wait for (pins & S) = D, do timeout ?
111111 111 I CCCC DDDDDDDDD SSSSSSSSS WAITPNE D,S WC 'wait for (pins & S) <> D, do timeout ?
----------------------------------------------------------------------------------------------------
* 1 + number of other instructions in the pipeline (0..3) which belong to the executing task

BRANCHES
--------

As elaborated on in the pipeline section, there are both normal and delayed branching instructions.
The normal branching instructions cancel any same-task instructions which are in the pipeline, causing
the next instruction that executes in that task to be from the address that was branched to. The delayed
branching instructions, intended only for single-task programs, do not cancel any pipelined instructions,
allowing the three trailing instructions in the pipeline to execute before the branch appears to take
effect. The advantage in using delayed branches is that they only take one clock, but careful programming
is required to accommodate the three trailing instructions:

loop MOV X,#100 'toggle P0/P1/P2 100 times, then toggle P3

loop2 DJNZD X,#loop2 'loop, delayed branch executes 3 trailing instructions
NOTP #0 'toggle P0
NOTP #1 'toggle P1
NOTP #2 'toggle P2

NOTP #3 'now toggle P3
JMP #loop 'do it again

In the branch instruction definitions below, only normal branches are shown, though any of them can be
made into delayed branches by adding a 'D' to their mnemonic (i.e. JMP becomes JMPD).

The JMP (jump), CALL, and RET (return) instructions are specific cases of the JMPRET instruction. CALL
works by simultaneously jumping to a labeled subroutine and storing the return address (the address after
the CALL) into a RET instruction that has the same label as the subroutine, but with '_RET' at the end:

loop CALL #sub1 'call to sub1, store next address into bits 8..0 of sub1_ret
CALL #sub2 'call to sub2, store next address into bits 8..0 of sub2_ret
JMP #loop 'loop back to first call

sub1 NOTP #0 'start of sub1 routine
sub1_ret RET 'return to caller (actually JMP #returnaddress)

sub2 NOTP #1 'start of sub2 routine
sub2_ret RET 'return to caller (actually JMP #returnaddress)

Because the return address is stored in an actual instruction at the end of the subroutine, these kinds
of calls cannot be recursive, unlike the stack RAM-based calls and returns which are elaborated on in the
STACK RAM section.

The WZ and WC suffixes can be used with CALL/RET instructions to control flag updating. For example,
if you wish to call a subroutine and preserve the Z and/or C flags, you can add the WZ and/or WC suffixes
to both the CALL and RET instructions to cause the flags to be initially saved on CALL and subsequently
restored on RET:

loop CMP a,b WZ,WC 'compare a to b, affect Z and C
CALL #sub WZ,WC 'call to sub and save Z/C/PC into bits 10..0 of the RET
IF_C_OR_Z JMP #loop 'loop if a =< b
JMP #else 'else, branch

sub GETP #0 WC 'get pin 0 into C (mess up C and Z)
GETNP #1 WZ 'get pin 1 into Z
SETPC #6 'set pin 6 to C
SETPZ #7 'set pin 7 to Z
sub_ret RET WZ,WC 'return to caller, restore Z/C/PC from bits 10..0 in RET

Here are the discrete JMP/CALL/RET instructions and the general-case JMPRET instruction:

JMP S - Jump to address in S[8..0]
If WC then C = S[9]
If WZ then Z = S[10]

JMP #n - Jump to immediate 0..511
If WC then C = bit 9 of JMP instruction (in unused D field)
If WZ then Z = bit 10 of JMP instruction (in unused D field)

CALL #label - Jump to label which begins subroutine
The assembler points the D field to the RET at label_RET
PC+1 is written to D[8..0] (PC+4 for CALLD)
If WC then C is written to D[9]
If WZ then Z is written to D[10]
D[31..11], plus D[10]/D[9] per WZ/WC, are preserved

RET - Jump to bits 8..0 of RET instruction (assembled as JMP #0)
If WC then C = bit 9 of RET instruction (in unused D field)
If WZ then Z = bit 10 of RET instruction (in unused D field)

JMPRET D,#n NR - Jump to immediate 0..511 (same as 'JMP #n' and 'RET')
If WC then C = bit 9 of JMPRET instruction (in D field)
If WZ then Z = bit 10 of JMPRET instruction (in D field)

JMPRET D,S NR - Jump to address in S[8..0] (same as 'JMP S')
If WC then C = S[9]
If WZ then Z = S[10]

JMPRET D,#n - Jump to immediate 0..511 (same as 'CALL #label')
PC+1 is written to D[8..0] (PC+4 for JMPRETD)
If WC then C is written to D[9], else D[9] same
If WZ then Z is written to D[10], else D[10] same
D[31..11] are preserved

JMPRET D,S - Jump to address in S[8..0]
PC+1 is written to D[8..0] (PC+4 for JMPRETD)
If WC then C is written to D[9] and reloaded from S[9]
If WZ then Z is written to D[10] and reloaded from S[10]
D[31..11], and D[10]/D[9] per WZ/WC, are preserved

TASKSW - Short for 'JMPRET INDA,++INDA WZ,WC'
For round-robin switching among threaded tasks
Use FIXINDA to set up a ring of Z/C/PC registers
Use with register remapping for multiple program instances
Instructions trailing TASKSWD are in the next thread

instructions clocks
-------------------------------------------------------------------------------------------------
000111 ZC0 0 CCCC 000000000 SSSSSSSSS JMP S 'jump to S 4 *
000111 ZC0 1 CCCC 000000000 nnnnnnnnn JMP #n 'jump to 0..511 4 *
000111 ZC0 1 CCCC 000000000 000000000 RET 'return from subroutine 4 *
000111 ZC1 1 CCCC DDDDDDDDD LLLLLLLLL CALL #label 'call subroutine 4 *
000111 ZCR 0 CCCC DDDDDDDDD SSSSSSSSS JMPRET D,S 'jump to S, store return in D 4 *
000111 ZCR 1 CCCC DDDDDDDDD nnnnnnnnn JMPRET D,#n 'jump to 0..511, store return in D 4 *
000111 111 0 0011 111110110 111110110 TASKSW 'JMPRET INDA,++INDA WZ,WC 4 *

010111 ZC0 0 CCCC 000000000 SSSSSSSSS JMPD S 'jump to S 1
010111 ZC0 1 CCCC 000000000 nnnnnnnnn JMPD #n 'jump to 0..511 1
010111 ZC0 1 CCCC 000000000 000000000 RETD 'return from subroutine 1
010111 ZC1 1 CCCC DDDDDDDDD LLLLLLLLL CALLD #label 'call subroutine 1
010111 ZCR 0 CCCC DDDDDDDDD SSSSSSSSS JMPRETD D,S 'jump to S, store return in D 1
010111 ZCR 1 CCCC DDDDDDDDD nnnnnnnnn JMPRETD D,#n 'jump to 0..511, store return in D 1
010111 111 0 0011 111110110 111110110 TASKSWD 'JMPRETD INDA,++INDA WZ,WC 1
-------------------------------------------------------------------------------------------------
* 4 clocks for single-task, actual count is 1 + number of same-task instructions in pipeline

Here are the conditional branches:

IJZ D,S/#n - Increment D and Jump to S/#n if result is zero
IJNZ D,S/#n - Increment D and Jump to S/#n if result is not zero
DJZ D,S/#n - Decrement D and Jump to S/#n if result is zero
DJNZ D,S/#n - Decrement D and Jump to S/#n if result is not zero
TJZ D,S/#n - Jump to S/#n if D is zero
TJNZ D,S/#n - Jump to S/#n if D is not zero
JP D,S/#n - Jump to S/#n if pin D reads high
JNP D,S/#n - Jump to S/#n if pin D reads low

instructions clocks
-------------------------------------------------------------------------------------------------
111100 00R I CCCC DDDDDDDDD SSSSSSSSS IJZ D,S 'increment D and jump if zero 4 *
111100 10R I CCCC DDDDDDDDD SSSSSSSSS IJNZ D,S 'increment D and jump if not zero 4 *
111101 00R I CCCC DDDDDDDDD SSSSSSSSS DJZ D,S 'decrement D and jump if zero 4 *
111101 10R I CCCC DDDDDDDDD SSSSSSSSS DJNZ D,S 'decrement D and jump if not zero 4 *
111110 000 I CCCC DDDDDDDDD SSSSSSSSS TJZ D,S 'test D and jump if zero 4 *
111110 100 I CCCC DDDDDDDDD SSSSSSSSS TJNZ D,S 'test D and jump if not zero 4 *
111110 001 I CCCC DDDDDDDDD SSSSSSSSS JP D,S 'jump if pin D high 4 *
111110 101 I CCCC DDDDDDDDD SSSSSSSSS JNP D,S 'jump if pin D low 4 *

111100 01R I CCCC DDDDDDDDD SSSSSSSSS IJZD D,S 'increment D and jump if zero 1
111100 11R I CCCC DDDDDDDDD SSSSSSSSS IJNZD D,S 'increment D and jump if not zero 1
111101 01R I CCCC DDDDDDDDD SSSSSSSSS DJZD D,S 'decrement D and jump if zero 1
111101 11R I CCCC DDDDDDDDD SSSSSSSSS DJNZD D,S 'decrement D and jump if not zero 1
111110 010 I CCCC DDDDDDDDD SSSSSSSSS TJZD D,S 'test D and jump if zero 1
111110 110 I CCCC DDDDDDDDD SSSSSSSSS TJNZD D,S 'test D and jump if not zero 1
111110 011 I CCCC DDDDDDDDD SSSSSSSSS JPD D,S 'jump if pin D high 1
111110 111 I CCCC DDDDDDDDD SSSSSSSSS JNPD D,S 'jump if pin D low 1
-------------------------------------------------------------------------------------------------
* 4 clocks for single-task, actual count is 1 + number of same-task instructions in pipeline

COUNTERS - this section is not done yet!!!
--------

Each cog has two configurable counters. They are named CTRA and CTRB and are accessed by
thirteen instructions each. The instructions which end in "A" are for CTRA and those that
end in "B" are for CTRB. For brevity, only CTRA instructions are used in the definitions and
examples that follow.

GETPHSA D - Get PHSA into D
GETPHZA D - Get PHSA into D, simultaneously clear PHSA to 0
GETCOSA D - Get COSA into D
GETSINA D - Get SINA into D

SETCTRA D/#n - Set CTRA configuration
SETWAVA D/#n - Set WAVA
SETFRQA D/#n - Set FRQA
SETPHSA D/#n - Set PHSA
ADDPHSA D/#n - Add to PHSA
SUBPHSA D/#n - Subtract from PHSA

SYNCTRA - Wait for PHSA to roll over
POLCTRA WC - Check if PHSA has rolled over (C=1 if rolled over)
CAPCTRA - Capture CTRA accumulators into COSA and SINA

Modes:

(QDR = PHS[31] XNOR PHS[30], or PHS[31] delayed by 90 degrees)

Off Mode
-------------------------------------------------------------------------------
%00000 = Counter off (initial state after cog start)

NCO Modes
-------------------------------------------------------------------------------
%00001 = NCO output + video PLL mode, PLL output = PHS[31] (reference signal)
%00010 = NCO output + video PLL mode, PLL output = PHS[31] times 8 divide by 32
%00011 = NCO output + video PLL mode, PLL output = PHS[31] times 8 divide by 16
%00100 = NCO output + video PLL mode, PLL output = PHS[31] times 8 divide by 8
%00101 = NCO output + video PLL mode, PLL output = PHS[31] times 8 divide by 4
%00110 = NCO output + video PLL mode, PLL output = PHS[31] times 8 divide by 2
%00111 = NCO output + video PLL mode, PLL output = PHS[31] times 8 divide by 1
%01000 = NCO output

DUAL Modes
-------------------------------------------------------------------------------
%000_01001 = dual NCO outputs + dual COUNT_LOWS inputs
%001_01001 = dual NCO outputs + dual COUNT_HIGHS inputs
%010_01001 = dual NCO outputs + dual COUNT_NEGATIVE_EDGES inputs
%011_01001 = dual NCO outputs + dual COUNT_POSITIVE_EDGES inputs
%100_01001 = dual NCO outputs + dual TIME_LOWS inputs
%101_01001 = dual NCO outputs + dual TIME_HIGHS inputs
%110_01001 = dual NCO outputs + dual TIME_NEGATIVE_EDGES inputs
%111_01001 = dual NCO outputs + dual TIME_POSITIVE_EDGES inputs

%000_01010 = dual DUTY outputs + dual COUNT_LOWS inputs
%001_01010 = dual DUTY outputs + dual COUNT_HIGHS inputs
%010_01010 = dual DUTY outputs + dual COUNT_NEGATIVE_EDGES inputs
%011_01010 = dual DUTY outputs + dual COUNT_POSITIVE_EDGES inputs
%100_01010 = dual DUTY outputs + dual TIME_LOWS inputs
%101_01010 = dual DUTY outputs + dual TIME_HIGHS inputs
%110_01010 = dual DUTY outputs + dual TIME_NEGATIVE_EDGES inputs
%111_01010 = dual DUTY outputs + dual TIME_POSITIVE_EDGES inputs

%000_01011 = dual PWM outputs + dual COUNT_LOWS inputs
%001_01011 = dual PWM outputs + dual COUNT_HIGHS inputs
%010_01011 = dual PWM outputs + dual COUNT_NEGATIVE_EDGES inputs
%011_01011 = dual PWM outputs + dual COUNT_POSITIVE_EDGES inputs
%100_01011 = dual PWM outputs + dual TIME_LOWS inputs
%101_01011 = dual PWM outputs + dual TIME_HIGHS inputs
%110_01011 = dual PWM outputs + dual TIME_NEGATIVE_EDGES inputs
%111_01011 = dual PWM outputs + dual TIME_POSITIVE_EDGES inputs

WAVE modes
-------------------------------------------------------------------------------
%01100 = dual SQR_WAVE output + GOERTZEL input
%01101 = dual SAW_WAVE output + GOERTZEL input
%01110 = dual TRI_WAVE output + GOERTZEL input
%01111 = dual SIN_WAVE output + GOERTZEL input

In the WAVE modes, FRQ is added into PHS on every clock cycle. The top nine bits of PHS
are used to drive sine and cosine lookup tables which are used for sine output functions
and GOERTZEL computations. While the sine/cosine output functions are the most useful for
signal processing, triangle-, sawtooth-, and square-wave output functions are also selectable,
being derived from the top nine bits of PHS, as well.

The WAVE modes output both parallel DAC signals and duty-modulated pin signals. All
output signals are nine bits in base quality with an additional nine sub-bits of dithering
to maintain base quality after attenuative scaling. The dual outputs differ only in phase
and are set up by the WAV register:

WAV register in WAVE modes (can be changed by SETWAVA/SETWAVB instruction)
-------------------------------------------------------------------------------
%PPPPPPPPP_xxxxx_TTTTTTTTT_AAAAAAAAA

PPPPPPPPP = phase advance for OUTA (0 to 511/512 revolutions)
xxxxx = unused for WAVE modes
TTTTTTTTT = offset for OUTA and OUTB
AAAAAAAAA = amplitude for OUTA and OUTB

Initial value after cog start:

%010000000_00000_100000000_111111111

010000000 = 90-degree phase advance for GOERTZEL use (OUTA=cosine, OUTB=sine)
00000 = unused
100000000 = mid-point offset (allows maximum amplitude)
111111111 = maximum amplitude

The GOERTZEL computation works as follows, on every clock:

Nine-bit sine and cosine values are looked up using the top nine bits of PHS.
The sine and cosine values are negated if INA is 0, else they remain the same.
The sine and cosine values are added into separate sine and cosine accumulators.

This process measures the energy content of INA at the frequency of PHS rollover.
To make this work, the INA pin should be configured for delta-sigma ADC mode, so
that it streams back 1's and 0's that ratiometrically represent the voltage of the
I/O pin.

To make a GOERTZEL measurement:

- The top nine bits of WAV should be set to %010000000 for proper cosine lookup.
- FRQ must be set to generate the frequency of interest in PHS rollovers (SETFRQA).
- PHS and the accumulators should be cleared to 0 (SETPHSA #0, then CAPCTRA).
- Some number of complete PHS rollovers must be waited for (SYNCTRA/POLLCTRA).
- The accumulators must be captured and read (CAPCTRA + GETCOSA + GETSINA).
- The hypotenuse of the accumulators will indicate signal strength and phase.

By making swept FRQ measurements in a closed loop, where OUTA is used to output a reference
frequency of known phase to stimulate a system, and INA receives a signal back that
is somehow coupled to OUTA, you can determine things such as spectral response, resonant
frequency, and frequency vs. phase of a system.

The more PHS rollovers in a measurement, the more selective the result will be. For open-
loop measurements, this means tighter bandwidth. For closed-loop measurements, the angle
of the hypotenuse becomes meaningful. The QARCTAN instruction can translate the sine and
cosine accumulations into power and phase values.

LOGIC Modes
-------------------------------------------------------------------------------
%10000 = LOGIC_A_POSEDGE input INA & !INA previous
%10001 = LOGIC_NA_AND_NB input !INA & !INB
%10010 = LOGIC_A_AND_NB input INA & !INB
%10011 = LOGIC_NB input !INB
%10100 = LOGIC_NA_AND_B input !INA & INB
%10101 = LOGIC_NA input !INA
%10110 = LOGIC_A_NE_B input INA <> INB
%10111 = LOGIC_NA_OR_NB input !INA | !INB
%11000 = LOGIC_A_AND_B input INA & INB
%11001 = LOGIC_A_EQ_B input INA == INB
%11010 = LOGIC_A input INA
%11011 = LOGIC_A_OR_NB input INA | !INB
%11100 = LOGIC_B input INB
%11101 = LOGIC_NA_OR_B input !INA | INB
%11110 = LOGIC_A_OR_B input INA | INB
%11111 = LOGIC_ENCODER input INA, INB encoder

OUTA = ADD signal (condition met or LOGIC_ENCODER forward step)
OUTB = SUB signal (LOGIC_ENCODER reverse step)

In the LOGIC modes, FRQ is conditionally added to PHS on each clock cycle that meets that
mode's requirement. In the case of the LOGIC_ENCODER mode, FRQ may be added or subtracted
to/from PHS when a half-step is registered. OUTA and OUTB reflect the ADD and SUB states
for each cycle, and are more likely to be useful by other CTR's, rather than being sent to
output pins.

DACS
----

Each cog outputs 4 channels of DAC data, named DAC0..DAC3. These DAC data channels can be
set to values in software or actively driven from CTRA/CTRB or VID. In all cases but VID,
the source data is 18 bits and is dithered on every clock cycle for 9-bit DAC output. In
the case of VID, the source data is just 9 bits, so no dithering is performed.

Each I/O pin has a 75-ohm 9-bit DAC which can be configured using CFGPINS to output a
fixed DACx channel from any cog. Every cog's DAC0..DAC3 are available, in that sequence,
to P0..P3, then to the next four pins, and so on, as shown below:

PortA PortB PortC DACx
--------------------------------
P0 P32 P64 DAC0
P1 P33 P65 DAC1
P2 P34 P66 DAC2
P3 P35 P67 DAC3
P4 P36 P68 DAC0
P5 P37 P69 DAC1
P6 P38 P70 DAC2
P7 P39 P71 DAC3
P8 P40 P72 DAC0
P9 P41 P73 DAC1
P10 P42 P74 DAC2
P11 P43 P75 DAC3
P12 P44 P76 DAC0
P13 P45 P77 DAC1
P14 P46 P78 DAC2
P15 P47 P79 DAC3
P16 P48 P80 DAC0
P17 P49 P81 DAC1
P18 P50 P82 DAC2
P19 P51 P83 DAC3
P20 P52 P84 DAC0
P21 P53 P85 DAC1
P22 P54 P86 DAC2
P23 P55 P87 DAC3
P24 P56 P88 DAC0
P25 P57 P89 DAC1
P26 P58 P90 DAC2
P27 P59 P91 DAC3
P28 P60 P92 DAC0
P29 P61 P93 DAC1
P30 P62 P94 DAC2
P31 P63 P95 DAC3

Here are the instructions which configure DAC0..DAC3:

CFGDAC0 D/#n - Configure DAC0

%00 = Software controlled (default)
%01 = CTRA SIGA
%10 = CTRA SIGA + CTRB SIGA
%11 = VID SIG0

CFGDAC1 D/#n - Configure DAC1

%00 = Software controlled (default)
%01 = CTRA SIGB
%10 = CTRA SIGB + CTRB SIGB
%11 = VID SIG1

CFGDAC2 D/#n - Configure DAC2

%00 = Software controlled (default)
%01 = CTRB SIGA
%10 = CTRA SIGA + CTRB SIGA
%11 = VID SIG2

CFGDAC3 D/#n - Configure DAC3

%00 = Software controlled (default)
%01 = CTRB SIGB
%10 = CTRA SIGB + CTRB SIGB
%11 = VID SIG3

CFGDACS D/#n - Configure DAC3..DAC0 from four 2-bit fields: %33_22_11_00

For configurations %00..%10, the data sources are 18 bits wide, with the 9 lower bits
being dithered by a 32-bit LFSR to realize more DAC resolution. This improves dynamic
range, but introduces a white noise of one step in amplitude in the 9-bit DAC output.
As dynamic signals get smaller in amplitude, they appear to sink into the dither noise,
but actually remain very high-Q, as the dither noise is very low-Q. For configuration
%11 (VID), the data is a straight 9 bits with no dithering, as pixels could only be
dithered once per frame, resulting only in visible luminance noise, which is not
desirable.

The dithering works by taking nine fixed bits from a 32-bit LFSR and sign-extending
them to 18 bits. This yields a pseudo-random value ranging from %111111111_100000000
(negative) to %000000000_011111111 (positive) on every clock cycle. When added to the
18-bit source data, the lower 9 bits of source data are realized as a proportional
toggling between two adjacent values in the top 9 bits of the sum, which form the DAC
output data. It will take at least 512 (2^9) clocks for the DAC output to average to
the intended 18-bit source value, assuming source data is static.

On cog start, all configurations are cleared to %00 and the source values are set to
%000000000_100000000, which is effectively zero, since dithering will never cause an
output step toggle when the nine lower source bits are %100000000:

source data %XXXXXXXXX_100000000
+ minimum dither %111111111_100000000
--------------------
= %XXXXXXXXX_000000000 (top 9 bits are unchanged)

source data %XXXXXXXXX_100000000
+ maximum dither %000000000_011111111
--------------------
= %XXXXXXXXX_111111111 (top 9 bits are unchanged)

Here are the instructions which set DAC0..DAC3 source values in software:

SETDAC0 #n - Set DAC0 to %nnnnnnnnn_100000000, force configuration to %00
SETDAC0 D - Set DAC0 to D[31..14], force configuration to %00 *

SETDAC1 #n - Set DAC1 to %nnnnnnnnn_100000000, force configuration to %00
SETDAC1 D - Set DAC1 to D[31..14], force configuration to %00 *

SETDAC2 #n - Set DAC2 to %nnnnnnnnn_100000000, force configuration to %00
SETDAC2 D - Set DAC2 to D[31..14], force configuration to %00 *

SETDAC3 #n - Set DAC3 to %nnnnnnnnn_100000000, force configuration to %00
SETDAC3 D - Set DAC3 to D[31..14], force configuration to %00 *

SETDACS #n - Set DAC3..DAC0 to %nnnnnnnnn_100000000
Force DAC3..DAC0 configurations to %00

SETDACS D - Set DAC3 to %dddddddd0_100000000, where dddddddd is D[31..24]
Set DAC2 to %dddddddd0_100000000, where dddddddd is D[23..16]
Set DAC1 to %dddddddd0_100000000, where dddddddd is D[15..8]
Set DAC0 to %dddddddd0_100000000, where dddddddd is D[7..0]
Force DAC3..DAC0 configurations to %00

* Be aware when using SETDACx D, that if D < $00400000 or D > $FFC03FFF, full-
scale toggling will occur, as the dither addition will cause wrapping. For
ground-based DAC output, you can add $00400000 to each output sample to
prevent this from happening.

VIDEO
-----

Each cog has a video generator (VID) that can stream pixel data and perform colorspace
conversion and modulation, so that final video signals can be output to the 75-ohm DACs
on the I/O pins.

Pixel streaming, colorspace conversion, modulation, DAC channel driving, and DAC pin
updating are all performed in a pipelined fashion on each cycle of VID's dot clock.

VID gets it dot clock from CTRA's PLL. So, CTRA must be configured for PLL operation in
order for VID to operate.

The DACx channels must be configured for video output by using CFGDACx. To set all DACx
channels to video, do 'CFGDACS #%11_11_11_11'.

The I/O pins which will output the DACx channels must be configured to do so via CFGPINS.

To turn on VID and configure its DAC channel outputs, the SETVID instruction is used:

SETVID D/#n - Set video configuration register (VCFG)

%00xx = off (default) SIG3 SIG2 SIG1 SIG0
----------------------------
%01xx = SDTV/HDTV/VGA Y_R I_G Q_B SYN
%10xx = NTSC/PAL S-VIDEO YIQ YIQ _IQ Y__
%11xx = NTSC/PAL COMPOSITE YIQ YIQ YIQ YIQ

%xx0x = zero-extend Y/I/Q coefficients for VGA colorspace (allows +$80, or '1.0')
%xx1x = sign-extend Y/I/Q coefficients for NTSC/PAL/SDTV/HDTV colorspace

%xxx0 = positive VGA sync on SYN / positive modulation phase
%xxx1 = negative VGA sync on SYN / negative modulation phase (used in PAL video)

Before any meaningful video signals can be output, you must set the colorspace coefficients
and offset levels, which are each 8 bits:

SETVIDY D/#n - Set Y_R offset level and RGB colorspace coefficients: $YO_YR_YG_YB

SETVIDI D/#n - Set I_G offset level and RGB colorspace coefficients: $IO_IR_IG_IB

SETVIDQ D/#n - Set Q_B offset level and RGB colorspace coefficients: $QO_QR_QG_QB

All pixels are internally handled by VID as 2:8:8:8 bit SYNC:R:G:B data.

Colorspace conversion is performed as sum-of-products calculations on the R:G:B pixel data
and the colorspace coefficients, yielding Y, I, and Q components:

Where R, G, B are 8-bit pixel color components and Y, I, Q are 9-bit sums (MOD 512):

Y = R*YR/64 + G*YG/64 + B*YB/64 Where YR, YG, YB are 8-bit Y coefficients
I = R*IR/64 + G*IG/64 + B*IB/64 Where IR, IG, IB are 8-bit I coefficients
Q = R*QR/64 + G*QG/64 + B*QB/64 Where QR, QG, QB are 8-bit Q coefficients

For outputs Y_R, I_G, and Q_B, offset levels are added to the Y, I, and Q components to
properly position the final signals for SDTV/HDTV. In the case of VGA outputs, the
offset levels are set to 0, since they are ground-based.

For modulated outputs YIQ and _IQ, the I and Q components, treated as (I,Q), are rotated
around (0,0) by an angle that steps 1/16th of a revolution on each dot clock, yielding
Q'. In the case of YIQ output, the Y component (luma) and Q' (chroma) are added to form
a composite video signal. In the case of _IQ output, an offset level is added to Q' to
form an s-video chroma signal. For Y__ output, the Y component (luma) is output alone to
form an s-video luma signal.

For sync 'pixels', bit 24 or 25 is set in the pixel word and various formulas are used for
generating the different output signals. When less than 32 bits are expressed per pixel, the
SYNC bits will be %00.

DAC channel outputs per pixel data input (outputs are 9 bits each, MOD 512)
------------------------------------------------------------------------------------
Y_R %x0_RRRRRRRR_GGGGGGGG_BBBBBBBB = YO*2 + Y component/vga pixel
%x1_0xxxxxxx_xxxxxxxx_xxxxxxxx = YO*2 component/vga black
%x1_1xxxxxxx_xxxxxxxx_SSSSSSSS = YO*2 + SSSSSSSS*2 component sync

I_G %x0_RRRRRRRR_GGGGGGGG_BBBBBBBB = IO*2 + I component/vga pixel
%x1_x0xxxxxx_xxxxxxxx_xxxxxxxx = IO*2 component/vga black
%x1_x1xxxxxx_xxxxxxxx_SSSSSSSS = IO*2 + SSSSSSSS*2 component sync

Q_B %x0_RRRRRRRR_GGGGGGGG_BBBBBBBB = QO*2 + Q component/vga pixel
%x1_xx0xxxxx_xxxxxxxx_xxxxxxxx = QO*2 component/vga black
%x1_xx1xxxxx_xxxxxxxx_SSSSSSSS = QO*2 + SSSSSSSS*2 component sync

SYN %x0_xxxxxxxx_xxxxxxxx_xxxxxxxx = VCFG[0]*511 vga sync unasserted
%x1_xxxxxxxx_xxxxxxxx_xxxxxxxx = !VCFG[0]*511 vga sync asserted

Y__ %00_RRRRRRRR_GGGGGGGG_BBBBBBBB = YO*2 + Y s-video luma pixel
%01_xxxxxxxx_xxxxxxxx_xxxxxxxx = IO*2 s-video luma sync high
%1x_xxxxxxxx_xxxxxxxx_xxxxxxxx = 0 s-video luma sync low

_IQ %xx_xxxxxxxx_xxxxxxxx_xxxxxxxx = QO*2 + Q' s-video chroma

YIQ %00_RRRRRRRR_GGGGGGGG_BBBBBBBB = YO*2 + Y + Q' composite pixel
%01_xxxxxxxx_xxxxxxxx_xxxxxxxx = IO*2 + Q' composite sync high
%1x_xxxxxxxx_xxxxxxxx_xxxxxxxx = Q' composite sync low

Below are some common colorspace coefficient sets. Note that these values are normalized
to 1. In the sum-of-products calculations, 128 is equal to 1, so the values below should all
be multiplied by 128 to get the proper 8-bit values for usage as coefficients. In practice,
the values will need to be scaled down so that under 75-ohm load, they will peak at 1.0V or
0.7V (not 1.65V, which is 3.3V/2). This scaling will compromise DAC span by ~39%..~58%,
leaving you with a still-sufficient ~8 bits of DAC resolution. However, if you'd like to
keep DAC span maximal, you may leave the coefficients as originally computed and achieve
the proper voltage under load by using external resistors, being sure to maintain 75 ohms
source impedance.

coefficient positions
-----------------------
YR YG YB
IR IG IB
QR QG QB
-----------------------

RGB (VGA) VCFG[1]=0
-----------------------
1 0 0 R sums to 1
0 1 0 G sums to 1
0 0 1 B sums to 1
-----------------------

YPbPr (HDTV) VCFG[1]=1 x128
----------------------- -------------
+.213 +.715 +.072 Y sums to 1 +27 +92 +9
-.115 -.385 +.500 Pb sums to 0 -15 -49 +64
+.500 -.454 -.046 Pr sums to 0 +64 -58 -6
-----------------------

YPbPr (SDTV) VCFG[1]=1
-----------------------
+.299 +.587 +.114 Y sums to 1
-.169 -.331 +.500 Pb sums to 0
+.500 -.419 -.081 Pr sums to 0
-----------------------

YIQ (NTSC) VCFG[1]=1
-----------------------
+.299 +.587 +.114 Y sums to 1
+.596 -.274 -.322 I sums to 0 *
+.212 -.523 +.311 Q sums to 0 *
-----------------------

YUV (PAL) VCFG[1]=1
-----------------------
+.299 +.587 +.114 Y sums to 1
-.147 -.289 +.436 U sums to 0 *
+.615 -.515 -.100 V sums to 0 *
-----------------------

* These three coefficients must be scaled by 0.608 to pre-compensate for CORDIC
rotator expansion which will occur in the video modulator.

Once VID is configured, WAITVID instructions are used to issue contiguous commands
which keep the pixel streamer busy:

WAITVID --> pixel streamer --> colorspace/modulator --> DACx signals --> I/O pins

VID double-buffers WAITVID commands to relax WAITVID timing requirements.

In case you don't want to commit to a WAITVID, which will stall the instruction pipeline
until VID is ready for another command, you can use the POLVID instruction to test
whether or not VID is ready for another WAITVID, in which case a subsequent WAITVID will
take only one clock:

POLVID WC - Check if VID ready for another WAITVID, C=1 if ready

Here is the WAITVID instruction:

WAITVID D,S/#n - Wait for VID ready, then give next command via D and S

When WAITVID executes, the D and S values are captured by VID and used for the duration
of the command.

The D operand in WAITVID has four fields:

%AAAAAAAA_MMMM_PPPPPPP_CCCCCCCCCCCCC

%AAAAAAAA = stack RAM base address for pixel lookup (0..255)
%MMMM = pixel mode (0..15), elaborated below
%PPPPPPP = number minus 1 of dot clocks per pixel (0..127 --> 1..128)
%CCCCCCCCCCCCC = number minus 1 of dot clocks in WAITVID (0..8191 --> 1..8192)

The D operand's %MMMM field determines which pixel mode will be used for the WAITVID and
what the S operand will be used for:

%0000 = LIT_SRGB26 - S is used as a literal 2:8:8:8 pixel. Only the %CCCCCCCCCCCCC
bits of D are used (all other bits can be 0).

%0001 = CLU1_SRGB26 - 32 1-bit offsets in S lookup 2:8:8:8 pixel longs in stack RAM
%0010 = CLU2_SRGB26 - 16 2-bit offsets in S lookup 2:8:8:8 pixel longs in stack RAM
%0011 = CLU4_SRGB26 - 8 4-bit offsets in S lookup 2:8:8:8 pixel longs in stack RAM
%0100 = CLU8_SRGB26 - 4 8-bit offsets in S lookup 2:8:8:8 pixel longs in stack RAM
%0101 = CLU8_RGB15 * - 4 8-bit offsets in S lookup 0:5:5:5 pixel words in stack RAM
%0110 = CLU8_RGB16 * - 4 8-bit offsets in S lookup 0:5:6:5 pixel words in stack RAM

The CLUx modes capture S, using its 1/2/4/8-bit fields, lowest
field first, as offsets for looking up pixels in stack RAM,
starting at %AAAAAAAA. Upon completion of each pixel, the next
higher bit field is used, with the highest field repeating.

For CLU1_SRGB26..CLU8_SRGB26, the 1/2/4/8-bit fields are used
as long offsets into stack RAM, yielding 2:8:8:8 pixel data.

For CLU8_RGB15 and CLU8_RGB16, bits 7..1 of each 8-bit field
is used as the long offset, while bit 0 selects the low/high
word containing the 0:5:5:5 or 0:5:6:5 pixel data.

%0111 = STR1_RGB9 * - 1-bit pixels streamed from stack RAM select between 0:3:3:3
colors in S[17..9] and S[26..18]. The stream start address in
stack RAM is %AAAAAAAA plus S[7..0], with S[31..27] selecting
the starting bit.

%1000 = STR4_RGBI4 * - 4-bit pixels are streamed from stack RAM starting at %AAAAAAAA
plus S[7:0], with S[31..29] selecting the starting nibble. The
pixels are colored as:

%0000 = black
%0001 = dark grey
%0010 = dark blue
%0011 = bright blue
%0100 = dark green
%0101 = bright green
%0110 = dark cyan
%0111 = bright cyan
%1000 = dark red
%1001 = bright red
%1010 = dark magenta
%1011 = bright magenta
%1100 = olive
%1101 = yellow
%1110 = light grey
%1111 = white

%1001 = STR4_LUMA4 * - 4-bit pixels are streamed from stack RAM starting at %AAAAAAAA
plus S[7:0], with S[31..29] selecting the starting nibble. The
pixels are used as brightness values for colors determined by
S[11..9]:

%000 = black..orange
%001 = black..blue
%010 = black..green
%011 = black..cyan
%100 = black..red
%101 = black..magenta
%110 = black..yellow
%111 = black..white

%1010 = STR8_RGBI8 * - 8-bit pixels are streamed from stack RAM starting at %AAAAAAAA
plus S[7:0], with S[31..30] selecting the starting byte. The
pixels are colored as:

$00..$1F = black..orange
$20..$3F = black..blue
$40..$5F = black..green
$60..$7F = black..cyan
$80..$9F = black..red
$A0..$BF = black..magenta
$C0..$DF = black..yellow
$E0..$FF = black..white

%1011 = STR8_LUMA8 * - 8-bit pixels are streamed from stack RAM starting at %AAAAAAAA
plus S[7:0], with S[31..30] selecting the starting byte. The
pixels are used as brightness values for colors determined by
S[11..9]:

%000 = black..orange
%001 = black..blue
%010 = black..green
%011 = black..cyan
%100 = black..red
%101 = black..magenta
%110 = black..yellow
%111 = black..white

%1100 = STR8_RGB8 * - 8-bit 0:3:3:2 pixels are streamed from stack RAM starting at
%AAAAAAAA plus S[7:0], with S[31..30] selecting the starting byte.

%1101 = STR16_RGB15 * - 15-bit 0:5:5:5 pixels are streamed from stack RAM starting at
%AAAAAAAA plus S[7:0], with S[31] selecting the starting word.

%1110 = STR16_RGB16 * - 16-bit 0:5:6:5 pixels are streamed from stack RAM starting at
%AAAAAAAA plus S[7:0], with S[31] selecting the starting word.

%1111 = STR32_SRGB26 - 26-bit 2:8:8:8 pixels are streamed from stack RAM starting at
%AAAAAAAA plus S[7:0].

* SYNC bits are set to %00 for these modes, since they specify color data, only.

The following example programs display luma-graduated color bars in various output modes:

simple_VGA_1280x1024.spin
simple_VGA_800x600.spin
simple_VGA_640x480.spin
simple_HDTV_1920x1080p.spin
simple_HDTV_1280x720p.spin
simple_NTSC_256x192.spin

TEXTURE MAPPER
--------------

Each cog has a texture mapper (PIX) which can sequentially navigate a rectangular 2D texture
map with Z-perspective correction to locate a texture pixel, translate that texture pixel into
A:R:G:B (Alpha:Red:Green:Blue) pixel data, perform discrete scaling on those A:R:G:B components,
and then alpha-blend the resulting pixel with another pixel for multi-layered 3D effects.

A texture map is stored in register RAM as a sequence of 1/2/4/8-bit texture pixels which build
from the bottom bits of an initial register, upward, then into subsequent registers. They are
ordered, in contiguous sequence, from top-left to top-right down to bottom-left to bottom-right.
These texture pixels get used as offsets into stack RAM to look up A:R:G:B pixel data. Texture
map width and height are individually settable to 1/2/4/8/16/32/64/128 pixel(s).

The SETPIX instruction is used to configure PIX:

SETPIX D/#n - Set PIX configuration to %UUU_VVV_PP_W_H_V_xxxx_AAAAAAAA_RRRRRRRRR

%UUU = texture map width, %VVV = texture map height

%000 = 1 pixel
%001 = 2 pixels
%010 = 4 pixels
%011 = 8 pixels
%100 = 16 pixels
%101 = 32 pixels
%110 = 64 pixels
%111 = 128 pixels

%PP = texture pixel size

%00 = 1 bit
%01 = 2 bits
%10 = 4 bits
%11 = 8 bits

%W = stack RAM pixel data offset/size

%0 = long offset, 8:8:8:8 bit A:R:G:B data
%1 = word offset, 1:5:5:5 bit A:R:G:B data (gets expanded to 8:8:8:8)

%H = horizontal mirroring

%0 = OFF, image repeats when U'[15] set
%1 = ON, image mirrors when U'[15] set

%V = vertical mirroring

%0 = OFF, image repeats when V'[15] set
%1 = ON, image mirrors when V'[15] set

%AAAAAAAA = base address in stack RAM of A:R:G:B pixel data

%RRRRRRRRR = base address in register RAM of texture pixels

Aside from SETPIX, which configures PIX's base metrics, there are seven other instructions
which establish initial values and deltas for the (U,V) texture coordinates, Z perspective,
and A/R/G/B scalers. These instructions are likely to be used before every sequence of GETPIX
instructions. They each set the value of their respective 16-bit parameter to the low word of
their operand, while the high word sets the 16-bit delta which gets added to the parameter
upon every GETPIX instruction:

SETPIXU D/#n - Set U to low word and DU to high word
SETPIXV D/#n - Set V to low word and DV to high word
SETPIXZ D/#n - Set Z to low word and DZ to high word
SETPIXA D/#n - Set A to low word and DA to high word
SETPIXR D/#n - Set R to low word and DR to high word
SETPIXG D/#n - Set G to low word and DG to high word
SETPIXB D/#n - Set B to low word and DB to high word

Once PIX is configured and initial parameters are set, the GETPIX instruction may be used to
look up the current texture pixel, scale its A/R/G/B components, blend it with a pixel in D,
and update the U/V/Z/A/R/G/B parameters with their deltas. GETPIX takes 3 clocks and also
needs 3 clocks in pipeline stages 2 and 3:

NOP #2 'ready pipeline, GETPIX needs 3 clocks in pipeline stage 2
NOP #2 'ready pipeline, GETPIX needs 3 clocks in pipeline stage 3
GETPIX pixel 'execute GETPIX, GETPIX takes 3 clocks in pipeline stage 4

To make GETPIX more efficient, it can be repeated using REPD to perform a sequence of pixel
operations:

REPD #64,#1 'render 64 texture pixels and blend them with 'pixels'
SETINDA #pixels 'point INDA to pixels
NOP #2 'ready pipeline, 3 clocks in initial pipeline stage 2
NOP #2 'ready pipeline, 3 clocks in initial pipeline stage 3
GETPIX INDA++ 'execute GETPIX, 3 clocks per repeating GETPIX

As GETPIX executes, the following sequence occurs over three pipeline stages:

In pipeline stage 2:

Z-perspective correction
------------------------
Z' = 256 - Z[15:8]
U' = (U[15:0] / Z') MOD 256
V' = (V[15:0] / Z') MOD 256

A texture pixel is read from register RAM at texture map location (U',V'), with
the U' and V' top-most bits being used as coordinates. For example, if the texture
size is 32x8, then the top 5 bits of U' and the top 3 bits of V' would be used to
locate the texture pixel.

parameter updating
------------------
Z = Z + DZ
U = U + DU
V = V + DV

In pipeline stage 3:

The texture pixel is used as an offset to look up A:R:G:B pixel data in stack RAM,
which gets assigned to TA:TR:TG:TB.

In pipeline stage 4:

pixel scaling
-------------
A' = (TA * A[15:8] + 255) / 256
R' = (TR * R[15:8] + 255) / 256
G' = (TG * G[15:8] + 255) / 256
B' = (TB * B[15:8] + 255) / 256

pixel blending
--------------
D[31..24] = 0
D[23..16] = (A' * R' + (255 - A') * D[23..16] + 255) / 256
D[15..8] = (A' * G' + (255 - A') * D[15..8] + 255) / 256
D[7..0] = (A' * B' + (255 - A') * D[7..0] + 255) / 256

C = A' <> 0 (for GETPIX D/#n WC, C = texture pixel opacity <> 0)

parameter updating
------------------
A = A + DA
R = R + DR
G = G + DG
B = B + DB

Note that if Z[15:8] = 0, no scaling occurs, or (U',V') = (U[15:8],V[15:8]). The bigger
Z[15:8] gets, the more compressed the texture rendering becomes, until when Z[15:8] = 255,
(U',V') = (U[7:0],V[7:0]).

The following example programs provide simplistic illustrations of how PIX is used:

texture_NTSC_256x192.spin

PIN TRANSFER
------------

Each cog has a pin transfer circuit (XFR) which can automatically move data between pins
and QUADs, or from pins to stack RAM, in the background, while instructions execute.

XFR is configured with the SETXFR instruction:

SETXFR D/#n - Set XFR configuration to %MMM_PPP

%MMM = mode

%00x = off (initial state after cog start)
%010 = QUADs_to_16_pins
%011 = QUADs_to_32_pins
%100 = 16_pins_to_QUADs
%101 = 32_pins_to_QUADs
%110 = 16_pins_to_stack (@SPA++)
%111 = 32_pins_to_stack (@SPA++)

%PPP = pin group

%000 = pins 15..0 for 16-pin modes, pins 31..0 for 32-pin modes
%001 = pins 31..16 for 16-pin modes, pins 31..0 for 32-pin modes
%010 = pins 47..32 for 16-pin modes, pins 63..32 for 32-pin modes
%011 = pins 63..48 for 16-pin modes, pins 63..32 for 32-pin modes
%100 = pins 79..64 for 16-pin modes, pins 95..64 for 32-pin modes
%101 = pins 95..80 for 16-pin modes, pins 95..64 for 32-pin modes
%11x = no pins

For QUADs_to_16_pins mode (%010), on the cycle after SETXFR is executed, the following
8-clock pattern begins and then repeats indefinitely:

1st clock: QUAD0[15..0] is output to pins
2nd clock: QUAD0[31..16] is output to pins
3rd clock: QUAD1[15..0] is output to pins
4th clock: QUAD1[31..16] is output to pins
5th clock: QUAD2[15..0] is output to pins
6th clock: QUAD2[31..16] is output to pins
7th clock: QUAD3[15..0] is output to pins
8th clock: QUAD3[31..16] is output to pins

This mode is useful for coordinating with a 'RDQUAD PTRx++' instruction so that a
continuous stream of words from hub memory can be output to an SDRAM's DQ pins. This
enables SDRAM writing at the cog's hub bandwidth limit.

For QUADs_to_32_pins mode (%011), on the cycle after SETXFR is executed, the following
4-clock pattern begins and then repeats indefinitely:

1st clock: QUAD0 is output to pins
2nd clock: QUAD1 is output to pins
3rd clock: QUAD2 is output to pins
4th clock: QUAD3 is output to pins

For 16_pins_to_QUADs mode (%100), on the cycle after SETXFR is executed, the following
8-clock pattern begins and then repeats indefinitely:

1st clock: pins are sampled as low word
2nd clock: pins are sampled as high word, long is written to QUAD0
3rd clock: pins are sampled as low word
4th clock: pins are sampled as high word, long is written to QUAD1
5th clock: pins are sampled as low word
6th clock: pins are sampled as high word, long is written to QUAD2
7th clock: pins are sampled as low word
8th clock: pins are sampled as high word, long is written to QUAD3

This mode is useful for coordinating with a 'WRQUAD PTRx++' instruction so that a
continuous stream of words input from an SDRAM's DQ pins can be written to hub memory.
This enables SDRAM reading at the cog's hub bandwidth limit.

For 32_pins_to_QUADs mode (%101), on the cycle after SETXFR is executed, the following
4-clock pattern begins and then repeats indefinitely:

1st clock: pins are sampled and written to QUAD0
2nd clock: pins are sampled and written to QUAD1
3rd clock: pins are sampled and written to QUAD2
4th clock: pins are sampled and written to QUAD3

For 16_pins_to_stack mode (%110), on the cycle after SETXFR is executed, the following
2-clock pattern begins and then repeats indefinitely:

1st clock: pins are sampled as low word
2nd clock: pins are sampled as high word, long is written to stack at SPA++

For 32_pins_to_stack mode (%111), on the cycle after SETXFR is executed, the following
1-clock pattern begins and then repeats indefinitely:

1st clock: pins are sampled and written to stack at SPA++

While a pins_to_stack mode is active, you should not read or write stack RAM or modify
SPA, as such attempts will likely cause unexpected results.

To stop XFR, execute 'SETXFR #0'.