Overview

These are the intrinsic functions used for implementing a peak cancellation based crest factor reduction (PC-CFR) application. The functionality for this application is split between AIE and programmable logic (PL), where the PL carries out the peak detections and AIE computes the aggregate cancellation signal for the detected peaks. The cancellation signal samples computed by the AIE are subtracted in the PL from the delayed original signal, to cancel the peaks.

The AIE computes the cancellation signal samples by scaling the cancellation pulse (CP) coefficients (which are stored in the AIE memory) for different peaks and summing them up. The two input stream interfaces of the AI Engine are used to receive the following information from the PL: 1) Metadata for LUT indices to read CP coefficients + configuration information for the vectorized mul/mac operations, 2) Complex scaling factors for the detected peaks. The output stream interface of the AI Engine is employed to send the computed cancellation signal samples to the PL.

Typically the AIE program computing the aggregate cancellation signal for N detected peaks comprises the following steps :

Read the input streams:
- Split the metadata from input stream port 0 into CP lut index (idx) and configuration information (ci) to be used by mul or mac intrinsics
- Get the scaling factors from input stream port 1 and write them into a scaling factor buffer
Load CP_LUT(idx) and CP_LUT(idx+1) from memory into CP buffer
Pass ci as a parameter to configure the mul intrinsic, which multiplies CP coefficients selected from the CP buffer with
the scaling factor for the first peak from the scaling factor buffer
Repeat the above steps N-1 times by using the mac intrinsic (instead of mul) to find the accumulated result for N peaks
Move the accumulated result through the SRS unit to the output stream.

Functions
void	split (int a, unsigned n, int &d0, unsigned &d1)
	Intrinsic used to split the 32 bit input data into two resulting variables at the n-th bit. More...

CFR Multiplication Intrinsics
v8cacc48	mul8_cfr (v16cint16 xbufa, v16cint16 xbufb, int rev_xstart, int xrot, v8cint16 zbuf, unsigned int zstart)
	Complex multiply intrinsic function for cancellation signal calculations in peak-cancellation crest factor reduction algorithm. More...

v8cacc48	mac8_cfr (v8cacc48 acc, v16cint16 xbufa, v16cint16 xbufb, int rev_xstart, int xrot, v8cint16 zbuf, unsigned int zstart)
	Complex multiply intrinsic function for cancellation signal calculations in peak-cancellation crest factor reduction algorithm. More...

Function Documentation

v8cacc48 mac8_cfr	(	v8cacc48	acc,
		v16cint16	xbufa,
		v16cint16	xbufb,
		int	rev_xstart,
		int	xrot,
		v8cint16	zbuf,
		unsigned int	zstart
	)

Complex multiply intrinsic function for cancellation signal calculations in peak-cancellation crest factor reduction algorithm.

PARAMETERS

Input/Output	Type	Valid bits	Comments
acc	v8cacc48	All	Running accumulation vector (8 x cint48 lanes). Only in mac variant.
xbufa	v16cint16	All	First input buffer of 16 complex samples of type cint16
xbufb	v16cint16	All	Second input buffer of 16 complex samples of type cint16
rev_xstart	int	5b LSB	MSB : Flag for backwards input selection / 4b LSB : select starting point within input data.
xrot	int	2b LSB	Selects which 256b lanes of 8 complex samples from bufa and bufb to use. This must be a compile time constant.
zbuf	v8cint16	All	Buffer of scaling factors for each qualified peak
zstart	int	3b LSB	Selects which of the 8 scaling factor values is used. This must be a compile time constant.
return value	v8cacc48	All	Resulting accumulation vector (8 x cint48 lanes)

Note: Parameters 'xrot' and 'zstart' must be compile time constants.

The input data provided by xbufa and xbufb can be seen as a concatenation of 8 cancellation pulse (CP) coefficients of type cint16 from xbufa followed by the next 8 coefficients from xbufb as selected by xrot. The resulting 16 samples will be referred to as "CP" in this document. The CP coefficients are loaded to xbufa and xbufb from memory. zbuf contains the 8 scaling factor values, one for each qualified peak. The zstart parameter is used to select the scaling factor for each mul operation.

DATA SELECTORS

xrot:

Selects the first or second set of 8 values to be used from both buffers A and B :

xrot value	selection in xbufa	selection in xbufb
0x0	coefficients 0 to 7	coefficients 0 to 7
0x1	coefficients 8 to 15	coefficients 0 to 7
0x2	coefficients 0 to 7	coefficients 8 to 15
0x3	coefficients 8 to 15	coefficients 8 to 15

Examples :

If you have updated previously updated xbufa with upd_w(0) (values 0 to 7 have been replaced), and xbufb with upd_w(1) (values 8 to 15), you would chose xrot=0x2

CP(0)	CP(1)	CP(2)	CP(3)	CP(4)	CP(5)	CP(6)	CP(7)	CP(8)	CP(9)	CP(10)	CP(11)	CP(12)	CP(13)	CP(14)	CP(15)
xbufa(0)	xbufa(1)	xbufa(2)	xbufa(3)	xbufa(4)	xbufa(5)	xbufa(6)	xbufa(7)	xbufb(8)	xbufb(9)	xbufb(10)	xbufb(11)	xbufb(12)	xbufb(13)	xbufb(14)	xbufb(15)

If you have xrot=0x1 :

CP(0)	CP(1)	CP(2)	CP(3)	CP(4)	CP(5)	CP(6)	CP(7)	CP(8)	CP(9)	CP(10)	CP(11)	CP(12)	CP(13)	CP(14)	CP(15)
xbufa(8)	xbufa(9)	xbufa(10)	xbufa(11)	xbufa(12)	xbufa(13)	xbufa(14)	xbufa(15)	xbufb(0)	xbufb(1)	xbufb(2)	xbufb(3)	xbufb(4)	xbufb(5)	xbufb(6)	xbufb(7)

It is standard practice to use only upd_w(0) and leave xrot at 0x0 unless your application can benefit from this option.

zstart:

Selects which of the 8 scaling factor values in zbuf will be used for the multiply operation, simply varies between 0x0 and 0x7

rev_xstart:

The 4 LSB select the starting point within the 16 CP values, since only 8 input CP values will be used for a mac operation.

Once the starting point is selected, the remaining MSB of rev_xstart determines which direction the operation will take place. The use of this flag improves the memory efficiency for conjugate-symmetric CPs since only half of the CP coefficients need to be present in the memory.

Example :

If the 4 bits are set to 0x7, CP(7) will be selected as the starting point. Then the MSB of rev_xstart will influence the way the operation works :

Set to 0 : CP(7) up to CP(14) will be used

CP(0)	CP(1)	CP(2)	CP(3)	CP(4)	CP(5)	CP(6)	CP(7)	CP(8)	CP(9)	CP(10)	CP(11)	CP(12)	CP(13)	CP(14)	CP(15)
							start	---—>	---—>	----—>	----—>	----—>	----—>	end

Set to 1 : The complex conjugates of CP(7) down to CP(0) will be used

CP(0)	CP(1)	CP(2)	CP(3)	CP(4)	CP(5)	CP(6)	CP(7)	CP(8)	CP(9)	CP(10)	CP(11)	CP(12)	CP(13)	CP(14)	CP(15)
end	<---—	<---—	<---—	<---—	<---—	<---—	start

FULL EXAMPLES

For both examples, the CP values will have been loaded into the lower half of xbufa and xbufb before they are passed to the function, and xrot can be left at 0x0.

The mac variant of this intrinsic is similar, but accumulates into acc instead of assignment.

Example 1 (MSB of rev_xstart=0)

Command : mul8_cfr(xbufa, xbufb, 0x04, 0x0, zbuf, 0x2)

zstart is at 0x2, the third scaling factor value of zbuf will be used.
The 4 LSB of rev_xstart are set to 0x4, the starting point within the 16 available CP values will be 4.

Resulting operation :

acc(0) = zbuf(2) * CP(4)
acc(1) = zbuf(2) * CP(5)
acc(2) = zbuf(2) * CP(6)
acc(3) = zbuf(2) * CP(7)
acc(4) = zbuf(2) * CP(8)
acc(5) = zbuf(2) * CP(9)
acc(6) = zbuf(2) * CP(10)
acc(7) = zbuf(2) * CP(11)

Example 2 (MSB of rev_xstart=1)

Command : mul8_cfr(xbufa, xbufb, 0x09, 0x0, zbuf, 0x3)

zstart is at 0x3, the fourth scaling factor value of zbuf will be used.
The 4 LSB of rev_xstart are set to 0x9, the starting point within the 16 available CP values will be 9.
The MSB of rev_xstart is set to 1, so the CP values will go from CP(9) down to CP(2) and use the complex conjugates

Resulting operation :

acc(0) = zbuf(3) * conj(CP(9))
acc(1) = zbuf(3) * conj(CP(8))
acc(2) = zbuf(3) * conj(CP(7))
acc(3) = zbuf(3) * conj(CP(6))
acc(4) = zbuf(3) * conj(CP(5))
acc(5) = zbuf(3) * conj(CP(4))
acc(6) = zbuf(3) * conj(CP(3))
acc(7) = zbuf(3) * conj(CP(2))

v8cacc48 mul8_cfr	(	v16cint16	xbufa,
		v16cint16	xbufb,
		int	rev_xstart,
		int	xrot,
		v8cint16	zbuf,
		unsigned int	zstart
	)

Complex multiply intrinsic function for cancellation signal calculations in peak-cancellation crest factor reduction algorithm.

PARAMETERS

Input/Output	Type	Valid bits	Comments
acc	v8cacc48	All	Running accumulation vector (8 x cint48 lanes). Only in mac variant.
xbufa	v16cint16	All	First input buffer of 16 complex samples of type cint16
xbufb	v16cint16	All	Second input buffer of 16 complex samples of type cint16
rev_xstart	int	5b LSB	MSB : Flag for backwards input selection / 4b LSB : select starting point within input data.
xrot	int	2b LSB	Selects which 256b lanes of 8 complex samples from bufa and bufb to use. This must be a compile time constant.
zbuf	v8cint16	All	Buffer of scaling factors for each qualified peak
zstart	int	3b LSB	Selects which of the 8 scaling factor values is used. This must be a compile time constant.
return value	v8cacc48	All	Resulting accumulation vector (8 x cint48 lanes)

Note: Parameters 'xrot' and 'zstart' must be compile time constants.

The input data provided by xbufa and xbufb can be seen as a concatenation of 8 cancellation pulse (CP) coefficients of type cint16 from xbufa followed by the next 8 coefficients from xbufb as selected by xrot. The resulting 16 samples will be referred to as "CP" in this document. The CP coefficients are loaded to xbufa and xbufb from memory. zbuf contains the 8 scaling factor values, one for each qualified peak. The zstart parameter is used to select the scaling factor for each mul operation.

DATA SELECTORS

xrot:

Selects the first or second set of 8 values to be used from both buffers A and B :

xrot value	selection in xbufa	selection in xbufb
0x0	coefficients 0 to 7	coefficients 0 to 7
0x1	coefficients 8 to 15	coefficients 0 to 7
0x2	coefficients 0 to 7	coefficients 8 to 15
0x3	coefficients 8 to 15	coefficients 8 to 15

Examples :

If you have updated previously updated xbufa with upd_w(0) (values 0 to 7 have been replaced), and xbufb with upd_w(1) (values 8 to 15), you would chose xrot=0x2

CP(0)	CP(1)	CP(2)	CP(3)	CP(4)	CP(5)	CP(6)	CP(7)	CP(8)	CP(9)	CP(10)	CP(11)	CP(12)	CP(13)	CP(14)	CP(15)
xbufa(0)	xbufa(1)	xbufa(2)	xbufa(3)	xbufa(4)	xbufa(5)	xbufa(6)	xbufa(7)	xbufb(8)	xbufb(9)	xbufb(10)	xbufb(11)	xbufb(12)	xbufb(13)	xbufb(14)	xbufb(15)

If you have xrot=0x1 :

CP(0)	CP(1)	CP(2)	CP(3)	CP(4)	CP(5)	CP(6)	CP(7)	CP(8)	CP(9)	CP(10)	CP(11)	CP(12)	CP(13)	CP(14)	CP(15)
xbufa(8)	xbufa(9)	xbufa(10)	xbufa(11)	xbufa(12)	xbufa(13)	xbufa(14)	xbufa(15)	xbufb(0)	xbufb(1)	xbufb(2)	xbufb(3)	xbufb(4)	xbufb(5)	xbufb(6)	xbufb(7)

It is standard practice to use only upd_w(0) and leave xrot at 0x0 unless your application can benefit from this option.

zstart:

Selects which of the 8 scaling factor values in zbuf will be used for the multiply operation, simply varies between 0x0 and 0x7

rev_xstart:

The 4 LSB select the starting point within the 16 CP values, since only 8 input CP values will be used for a mac operation.

Once the starting point is selected, the remaining MSB of rev_xstart determines which direction the operation will take place. The use of this flag improves the memory efficiency for conjugate-symmetric CPs since only half of the CP coefficients need to be present in the memory.

Example :

If the 4 bits are set to 0x7, CP(7) will be selected as the starting point. Then the MSB of rev_xstart will influence the way the operation works :

Set to 0 : CP(7) up to CP(14) will be used

CP(0)	CP(1)	CP(2)	CP(3)	CP(4)	CP(5)	CP(6)	CP(7)	CP(8)	CP(9)	CP(10)	CP(11)	CP(12)	CP(13)	CP(14)	CP(15)
							start	---—>	---—>	----—>	----—>	----—>	----—>	end

Set to 1 : The complex conjugates of CP(7) down to CP(0) will be used

CP(0)	CP(1)	CP(2)	CP(3)	CP(4)	CP(5)	CP(6)	CP(7)	CP(8)	CP(9)	CP(10)	CP(11)	CP(12)	CP(13)	CP(14)	CP(15)
end	<---—	<---—	<---—	<---—	<---—	<---—	start

FULL EXAMPLES

For both examples, the CP values will have been loaded into the lower half of xbufa and xbufb before they are passed to the function, and xrot can be left at 0x0.

The mac variant of this intrinsic is similar, but accumulates into acc instead of assignment.

Example 1 (MSB of rev_xstart=0)

Command : mul8_cfr(xbufa, xbufb, 0x04, 0x0, zbuf, 0x2)

zstart is at 0x2, the third scaling factor value of zbuf will be used.
The 4 LSB of rev_xstart are set to 0x4, the starting point within the 16 available CP values will be 4.

Resulting operation :

acc(0) = zbuf(2) * CP(4)
acc(1) = zbuf(2) * CP(5)
acc(2) = zbuf(2) * CP(6)
acc(3) = zbuf(2) * CP(7)
acc(4) = zbuf(2) * CP(8)
acc(5) = zbuf(2) * CP(9)
acc(6) = zbuf(2) * CP(10)
acc(7) = zbuf(2) * CP(11)

Example 2 (MSB of rev_xstart=1)

Command : mul8_cfr(xbufa, xbufb, 0x09, 0x0, zbuf, 0x3)

zstart is at 0x3, the fourth scaling factor value of zbuf will be used.
The 4 LSB of rev_xstart are set to 0x9, the starting point within the 16 available CP values will be 9.
The MSB of rev_xstart is set to 1, so the CP values will go from CP(9) down to CP(2) and use the complex conjugates

Resulting operation :

acc(0) = zbuf(3) * conj(CP(9))
acc(1) = zbuf(3) * conj(CP(8))
acc(2) = zbuf(3) * conj(CP(7))
acc(3) = zbuf(3) * conj(CP(6))
acc(4) = zbuf(3) * conj(CP(5))
acc(5) = zbuf(3) * conj(CP(4))
acc(6) = zbuf(3) * conj(CP(3))
acc(7) = zbuf(3) * conj(CP(2))

void split	(	int	a,
		unsigned	n,
		int &	d0,
		unsigned &	d1
	)

Intrinsic used to split the 32 bit input data into two resulting variables at the n-th bit.

The split separates the 32 bits of into index info to update CP LUT pointers and intrinsic prepares the magnitude values for further processing in the DPD. The parameters are the following:

Parameters

Parameter	Type	Comments
a	int	Input data as a 32bit signed integer.
n	unsigned	Number of LSBs that shall end up in d1. This must be a compile-time constant
d0	int&	Output variable that will contain bits n to 31 of the input. Intended as an index and is a signed number (sign extended).
d1	unsigned&	Output variable that will contain bits 0 to n-1 of the input.

Example :

Command : split(data, 6, out1, out2)

We will imagine that data = 0x44FA, which gives the following operation :

data = 0100 0100 11|11 1010 (split after the n-th LSB, which is 6 in this example)

This gives :

out0 = 0000 0001 0001 0011
out1 = 0000 0000 0011 1010

Crest Factor Reduction Application

For Peak Cancellation CFR, one of the two input streams into an AI Engine is dedicated to communicate 32 bit metadata samples. The 27 MSB of a metadata sample are used for Cancellation Pulse (CP) LUT indexing, and the 5 LSB provide configuration information for the subsequent mul or mac operation.

See below for more information on how 5 LSB are used for configuring mul and mac operation:

v8cacc48 mul8_cfr(v16cint16 xbufa,v16cint16 xbufb, int rev_xstart, int xrot, v8cint16 zbuf, int zstart)

v8cacc48 mac8_cfr(v8cacc48 acc, v16cint16 xbufa,v16cint16 xbufb, int rev_xstart, int xrot, v8cint16 zbuf, int zstart)

Digital Pre-Distortion Appliction

The split intrinsic used in DPD applications is slightly different and has an additional parameter:

void split(int mag, int frac_bits, int lut_width, int& idx, unsigned& frac)

Overview

Functions

CFR Multiplication Intrinsics

Function Documentation

PARAMETERS

DATA SELECTORS

xrot:

zstart:

rev_xstart:

FULL EXAMPLES

Example 1 (MSB of rev_xstart=0)

Example 2 (MSB of rev_xstart=1)

PARAMETERS

DATA SELECTORS

xrot:

zstart:

rev_xstart:

FULL EXAMPLES

Example 1 (MSB of rev_xstart=0)

Example 2 (MSB of rev_xstart=1)

Parameters

Example :

Crest Factor Reduction Application

Digital Pre-Distortion Appliction