According to a McKinsey survey, 39% of organisations have already implemented some form of machine learning (ML) in their business. While this adoption is still relatively nascent, the prospect of improved efficiencies, customer behaviour prediction, and insightful business intelligence makes this an appealing technology for the pro AV and broadcast markets.
Media systems can take advantage of the ML capabilities on Xilinx platforms for AI edge processing. Processing directly at the edge, and without needing a network connection, has tremendous benefits in terms of low-latency performance, and could even be useful in overcoming many concerns around privacy and storage of identification metrics in the cloud. Incorporating these ML capabilities alongside audio and video processing pipelines into Xilinx’s adaptable platforms means organisations can monetise analytics, improve workflow efficiency and enhance usability. Ultimately, these integrated ML features allow companies to accelerate innovation and differentiation.
|Machine Learning Solution||Broadcast Use Cases||Pro AV Use Cases|
|Video Object Detection
||Lock to an object and create a bounding box to output a cropped portion of the original video||Pan, tilt, and zoom camera control to focus on speaker; better quality than crop & zoom|
|Detection of specific objects, such as people, animals, or cars. The area around the identified objects are bounded with a box and the box coordinates are fed into an encoder for ROI encoding.|
|Automation in live broadcast sports|
|Natural Language Processing
||Speech-to-text for closed captions or sub-titles||Automated note taking during conference|
|Script translation or movie regionalization||Detect stress in voice during kiosk interaction|
|Gender or Age Detection
||Serve signage ads based on gender or age|
|Video Quality Analysis
||Detect complex sequences and optimize encoding parameters|
||During live production, detect actor’s mood to determine if they acted according to the director's wishes.
Look for actors in a video clip with certain moods.
During post-production, use ML to slightly tweak the actor's facial expression based on creative/artistic intent
|Detect mood of a person using digital kiosk|
||Swipe with gestures to avoid touching interactive retail or kiosk screen
Control camera operation in collaboration
Streaming and storage costs for large video files and UHD content can easily stack up. Region-of-Interest (ROI) encoding can help ease this issue by reducing the overall bitrate of content and then applying best video quality (VQ) to areas that the eye is naturally drawn to, particularly faces and people, while reducing the VQ in less important areas such as backgrounds.
ROI can also be used to preserve details in the most important areas in control room applications. For example, if an incident occurs and is monitored on a large video wall, it’s important that details can be accurately discerned during follow-up investigation, and usable for training so that mistakes can be learned from and action plans improved. This means preserving high VQ in areas of text overlays (e.g. clocks) using static co-ordinates for ROI encoding and faces or people using dynamic and ML-based co-ordinates.
Speech recognition using natural language processing (NLP) is already apparent in the home, with Alexa, Google and other smart devices that can respond to commands and present information and media, or control aspects of the house. With NLP built into devices, the same capabilities can be applied in professional media, making equipment set-up quicker and less complicated, not requiring a cloud connection and removing the need for any related subscription services to perform the same task. With Edge AI, it’s now possible to automatically transcribe meeting notes using speech-to-text algorithms and summarisation models. It’s also possible to perform regional translation with the potential of virtually real-time subtitles in any language, which again could be applied to video conferencing applications, or to more traditional closed-caption systems in broadcast and cinema.
Targeted advertising is the holy grail for marketers. Using various ML models to analyse an audience in front of a digital sign, it’s possible to serve more relevant and targeted ads, based on metrics like age and gender. This makes the signage provider more attractive to advertisers who will be willing to pay more for better ad presentation. This also generates valuable data for the advertiser such as viewer interest, which can lead to improved usage of the service, and provides monetizable feedback to the manufacturers they represent. The viewer is also presented with relevant and more personalised ads, improving their overall shopping experience. Alternative ML models can be used in interactive kiosks, replacing touch screens with more hygienic gesture control to move to the next ad, or particularly for placing orders.
Imagine live-streaming a panel discussion about an artist’s work at a local college. This is a low-budget event with a niche audience so production costs are going to be very low. A single camera will typically be in use, capturing the whole panel with occasional zooming and panning. Using ML face-tracking, it’s possible to have a static 4K camera capture the whole panel, but automatically create extra lower resolution HD windowed outputs around each of the panellists and track them through the conversation. So, from a single 4K camera, it’s possible to have four different output shots to switch between during the live stream – the wide angle and three close-ups. This creates more visual interest and doesn’t require any extra camera equipment to set up – the camera operator can become the video mixer and simply select which frames to stream.
This approach can be applied, with various ML tracking models, in professional broadcast applications such as sports coverage or in collaboration environments where multiple video conferencing attendees can be tracked automatically.
Available from Xilinx partner MakarenaLabs, MuseBox is a real-time machine learning system designed for Pro AV and Broadcasting applications. It can work with live stream, for interactive or live applications, and it can work with local files, when you have a big amount of files to process and also when these files are not accessible outside the local network for legal reasons. It is based on a Zynq UltraScale+ MPSoC using multimedia and ML stacks, or on Xilinx Alveo accelerator cards for on-premise elaboration. MuseBox supports facial and people analysis, object detection, audio analysis and more!
MakarenaLabs are highly experienced in machine learning and offer a range of libraries and products for various AV use cases. The Mooseka system is used for audio analysis, recognition and features extraction and is used in their MRadio stream analyser to recognize music content for copyright enforcement and protection, radio promotion and marketing analysis.