Top-Down: Optical Flow Algorithm

The Lucas-Kanade (LK) method is a widely used differential method for optical flow estimation, or the estimation of movement of pixels between two related images. In this example system, the related images are the current and previous images of a video stream. The LK method is a compute intensive algorithm and works over a window of neighboring pixels using the least square difference to find matching pixels.

The code to implement this algorithm is shown below, where two input files are read in, processed through function fpga_optflow, and the results written to an output file.

int main()
	FILE *f;
	pix_t *inY1 = (pix_t *)sds_alloc(HEIGHT*WIDTH);
	yuv_t *inCY1 = (yuv_t *)sds_alloc(HEIGHT*WIDTH*2);
	pix_t *inY2 = (pix_t *)sds_alloc(HEIGHT*WIDTH);
	yuv_t *inCY2 = (yuv_t *)sds_alloc(HEIGHT*WIDTH*2);
	yuv_t *outCY = (yuv_t *)sds_alloc(HEIGHT*WIDTH*2);
	printf("allocated buffers\n");

	f = fopen(FILEINAME,"rb");
	if (f == NULL) {
		printf("failed to open file %s\n", FILEINAME);
		return -1;
	printf("opened file %s\n", FILEINAME);

	read_yuv_frame(inY1, WIDTH, WIDTH, HEIGHT, f);
	printf("read 1st %dx%d frame\n", WIDTH, HEIGHT);
	read_yuv_frame(inY2, WIDTH, WIDTH, HEIGHT, f);
	printf("read 2nd %dx%d frame\n", WIDTH, HEIGHT);

	printf("closed file %s\n", FILEINAME);

	convert_Y8toCY16(inY1, inCY1, HEIGHT*WIDTH);
	printf("converted 1st frame to 16bit\n");
	convert_Y8toCY16(inY2, inCY2, HEIGHT*WIDTH);
	printf("converted 2nd frame to 16bit\n");

	fpga_optflow(inCY1, inCY2, outCY, HEIGHT, WIDTH, WIDTH, 10.0);
	printf("computed optical flow\n");

	// write optical flow data image to disk
	write_yuv_file(outCY, WIDTH, WIDTH, HEIGHT, ONAME);

	printf("freed buffers\n");

return 0;

This is typical for a top-down design flow using standard C/C++ data types. Function fpa_optflow is shown below and contains the sub-function readMatRows, computeSum, computeFlow, getOutPix, and writeMatRows.

int fpga_optflow (yuv_t *frame0, yuv_t *frame1, yuv_t *framef, int height, int width, int stride, float clip_flowmag)
	  int img_pix_count = height*width;
	  int img_pix_count = 10;

  if (f0Stream == NULL) f0Stream = (pix_t *) malloc(sizeof(pix_t) * img_pix_count);
  if (f1Stream == NULL) f1Stream = (pix_t *) malloc(sizeof(pix_t) * img_pix_count);
  if (ffStream == NULL) ffStream = (yuv_t *) malloc(sizeof(yuv_t) * img_pix_count);

  if (ixix == NULL) ixix = (int *) malloc(sizeof(int) * img_pix_count);
  if (ixiy == NULL) ixiy = (int *) malloc(sizeof(int) * img_pix_count);
  if (iyiy == NULL) iyiy = (int *) malloc(sizeof(int) * img_pix_count);
  if (dix == NULL) dix = (int *) malloc(sizeof(int) * img_pix_count);
  if (diy == NULL) diy = (int *) malloc(sizeof(int) * img_pix_count);

  if (fx == NULL) fx = (float *) malloc(sizeof(float) * img_pix_count);
  if (fy == NULL) fy = (float *) malloc(sizeof(float) * img_pix_count);

  readMatRows (frame0, f0Stream, height, width, stride);
  readMatRows (frame1, f1Stream, height, width, stride);

  computeSum (f0Stream, f1Stream, ixix, ixiy, iyiy, dix, diy, height, width);
  computeFlow (ixix, ixiy, iyiy, dix, diy, fx, fy, height, width);
  getOutPix (fx, fy, ffStream, height, width, clip_flowmag);

  writeMatRows (ffStream, framef, height, width, stride);

  return 0;

In this example, all of the functions in fpga_optflow are processing live video data and thus would benefit from hardware acceleration with DMAs used to transfer the data to and from the PS. If all five functions are annotated to be hardware functions, the topology of the system is shown in the following figure.

The system can be compiled into hardware and event tracing used to analyze the performance in detail.

The issue here is that it takes a very long time to complete, approximately 15 seconds for a single frame. To process HD video, the system should process 60 frames per second, or one frame every 16.7 ms. A few optimization directives can be used to ensure the system meets the target performance.