微信視覺團隊斬獲CVPR Video Similarity大賽雙賽道冠軍，視頻號也用到了這些技術全球觀察

2023-06-24 06:38:48來源：機器之心

(資料圖片僅供參考)

機器之心專欄

機器之心編輯部

視頻的內容理解在內容審核、產品運營和搜索推薦等場景都有重要作用。其中，Video Similarity（視頻相似性）是視頻理解最底層最重要的技術之一，應用在短視頻搬運打擊、直播錄播和盜播打擊以及黑庫檢索等場景，這些應用對視頻內容生態(tài)至關重要。微信視覺團隊報名參加了 CVPR 2023 Video Similarity Challenge，該比賽由 Meta AI 主辦，旨在推動視頻拷貝檢測領域的進步。團隊最終獲得該比賽雙賽道冠軍，得分遠超其他團隊，相關技術方案也在視頻號落地使用。

任務背景

視頻拷貝檢測（Video Copy Detection）旨在檢測一個視頻是否拷貝了另外一個視頻，包括完整拷貝、片段剪輯以及各種濾鏡特效花邊字幕等編輯對抗。這種技術起源于視頻版權保護，隨著短視頻平臺的興起，視頻創(chuàng)作如雨后春筍般涌現(xiàn)，互聯(lián)網上每天有上億的新視頻創(chuàng)作和分享，同時也伴隨著極其嚴重的拷貝。如何打擊拷貝、鼓勵原創(chuàng)，對短視頻平臺的內容生態(tài)至關重要。而因為其中伴隨巨大經濟利益，黑灰產會通過各種編輯手段對抗檢測，這對技術提出了巨大的挑戰(zhàn)。

下面是一些視頻拷貝的實際例子，左邊和右邊分別是同一視頻的不同拷貝版本。

圖 1：視頻號的實際拷貝視頻舉例，有片段剪輯、剪裁、加黑邊等對抗

比賽介紹

Video?Similarity?Challenge 是由 Meta AI 在 CVPR 2023 Workshop 上舉辦的競賽，獎金 10 萬美元，旨在推動視頻拷貝檢測領域的進步。比賽設立了 Descriptor Track 和 Matching Track 兩個賽道，Descriptor Track 的目的是生成視頻 embedding 計算兩個視頻相似得分，embedding 可以通過向量索引快速召回相似視頻；而 Matching Track 則可以對召回的結果做精確的匹配，并進一步定位到拷貝片段。Descriptor Track 和 Matching Track 是 Video Copy Detection 工作中的兩個環(huán)節(jié)，每個環(huán)節(jié)對于最終的檢測效果都有重要的影響。

圖 2：Video Copy Detection 中 Descriptor Track 和 Matching Track 的關系。Descriptor Track 生成視頻 embedding 并從參考視頻中召回被拷貝視頻，Matching Track 在此基礎上定位拷貝片段。

數(shù)據(jù)

數(shù)據(jù)集中主要包含 query 和 reference 兩類 video，其中 reference 一般是用戶正常發(fā)表的視頻，與之存在拷貝關系的 query 則是通過一些編輯方法，對 reference 中的片段進行搬運和拷貝，從而產生的新視頻。下表是比賽數(shù)據(jù)集的分布統(tǒng)計情況，階段 1 和階段 2 是兩個獨立的封閉測試階段，兩個測試階段的 reference 集合一致。

通常來講，如果存在拷貝關系，query 和 video 在視頻的某些片段上，會存在高度的語義相似性。但并非所有的相似視頻都存在拷貝關系，如下圖所示，query 和 reference 雖然是相似視頻，但它們在視頻語義層面并不存在拷貝關系。所以判斷 query 和 reference 是否存在拷貝關系，需要分析和比對整個 video 層面的語義，這也是本次挑戰(zhàn)賽的難點之一。

圖 3：拷貝視頻樣例，左邊為 reference 視頻，右邊為拷貝了 reference 片段的 query 視頻

圖 4：左邊為 reference 視頻，右邊為正常的 query 視頻，兩者相似但不存在拷貝關系

評測方法

Descriptor Track，需要模型給每個 query 和 reference 推理至多 1 fps 的 embedding 集合，通過計算兩個 embedding 集合的 pairwise 最大內積相似性，得到每個 query 和 reference pair 拷貝關系的預測置信度。所有 query 和 reference 的置信度得分降序排列，通過一個全局的置信度閾值來控制召回的 pair 數(shù)目，最終與 ground truth 計算 micro-average precision。

Matching Track，模型需要不僅給出存在 copy 關系的 query 和 reference，還要求定位 copy segment 在 query 和 reference 中的起始位置，以及相應置信度。下圖給出了單個 segment 上 precision-recall 計算方法，可以看出 segment location 與真實 ground truth 的重合度越高，對應的 pr 值也越高。所有 segments 按置信度降序排列，最終與 ground truth 計算 micro-average precision。

圖 5：matching track 單個 segment 的 precision-recall 計算方法

相關工作

Descriptor Track

Descriptor 主要依賴 embedding 做召回，而 contrastive learning 依托于其高效率的學習方法，逐漸成為訓練 embedding 的主流方法。微信視覺團隊在 descriptor track 也基于對比學習的方案，并對幾篇經典的工作做了簡單梳理。SimCLR [20] 采用了隨機裁剪、縮放、翻轉、色彩失真和高斯模糊等更多樣的增強方法和組合，將同一批次內的其他樣本作為負樣本，框架簡單，效果顯著，但是受 batch size 大小影響大。MoCo [22] 構造了一個負樣本隊列來擴大采樣的負樣本數(shù)量和范圍，并通過動量編碼器更新隊列，從而避免了受 batch size 大小的影響。BYOL [21] 采用了非對稱的結構，不需要負樣本，通過自舉學習，使用兩個網絡（在線網絡和目標網絡）來訓練模型來避免 model collapse 的問題。SwAV [18] 引入了聚類的思想，不再需要成對的比較，而是比較在不同視角下的聚類結果。DINO [19] 動態(tài)更新 teacher-student 網絡，利用 teacher 蒸餾 student，用 momentum 機制做平滑，增加穩(wěn)定性同時避免 collapse。

Matching Track

拷貝片段的定位通?；趲墑e特征，因此傳統(tǒng)方法會產生一個幀到幀的相似度矩陣，在該相似度矩陣上定位連續(xù)片段。早期的工作有時序霍夫投票 (Temporal Hough Voting)[15]，基于圖結構的時序網絡 (Graph-based Temporal Network)[16]，和動態(tài)規(guī)劃算法 (Dynamic Programming)[17]。隨后，SPD [13] 將目標檢測引入該任務，使任務變?yōu)閺南嗨贫染仃嚿蠙z測拷貝區(qū)域。最近的 TransVCL [14] 引入 Transformer 結構進一步學習視頻間和視頻內的幀級別特征交互，取得了最新的 SOTA 結果。在比賽中，微信視覺團隊復現(xiàn)了 Temporal Network 和 TransVCL，并提出了自己的新方案，在比賽數(shù)據(jù)集上，微信視覺團隊的方案遠超這些學術 SOTA 方案。

Descriptor Track 解決方案

問題分析

Descriptor Track 的核心目的是基于 embedding 召回潛在的 copy video pair，在學術方法上，對比學習 contrastive learning 是訓練 embedding 的有效手段。因此如何在該場景下，針對數(shù)據(jù)集的特點和難點，訓練一個高效率的 embedding 是微信視覺團隊要探究的課題。首先，微信視覺團隊對數(shù)據(jù)做了細致的分析，總結了數(shù)據(jù)集中的幾種常見樣本：

無增強的視頻，它們更接近用戶發(fā)表的原視頻。經過統(tǒng)計發(fā)現(xiàn)，該類 query 存在 copy reference 的概率很低，但極易造成相似視頻的誤召回。

隨機增強的視頻，官方為了增加數(shù)據(jù)集的復雜性，對 query 和 reference 都做了不同程度的隨機增強，包括基礎的 GaussNoise、GaussBlur、Crop、Pad、Rotation、ColorJitter、Compression 等，也包含復雜的 OverlayEmoji、OverlayText、OverlayVideo 等。

多場景視頻，另一種困難樣本主要是在視頻幀中堆疊多個場景，這導致了同一幀中的場景差異很大，同時不同場景又各自會經過不同的增強，這使得常規(guī)的方式很難處理好這種樣本。

圖 6：Query 視頻中的 3 種類型的樣本，(a) 無增強視頻；(b) 增強視頻；(c) 多場景視頻

解決方案

經過數(shù)據(jù)分析，微信視覺團隊明確了該任務的主要難點，針對這些難點，提出了一個兩階段檢測方法來識別拷貝視頻。圖 7 展示了微信視覺團隊解決方案的整體框架，該方法主要分為 Frame-Level Embedding，Video Editing Detection 和 Frame Scenes Detection 三個模塊。

圖 7：微信視覺團隊提出的解決方案的推理過程，(a) query 視頻經過 Video Editing Detection 模塊得到高置信度的 query; (b) query 的每一幀經過 Frame Scenes Detection 做分析和多圖拆解；(c) 每一幀視頻經過基線模型提取 embedding，形成 query 的視頻幀 embedding 集合。

1. Frame-Level Embedding

模型框架:由于需要兼容 Matching Track 對幀級別特征的需求，微信視覺團隊訓練的表征模型是在幀級別上進行的，主要基于 contrastive learning 框架進行自監(jiān)督訓練。對于采樣到的視頻幀，微信視覺團隊基于上面提到的增強方式對視頻幀進行不同的變換增強得到兩張圖像作為正樣本，其他圖像作為負樣本進行學習。為了測試不同種類的基礎 backbone 性能，以及方便后續(xù)做模型 ensemble，微信視覺團隊訓練了 CNN-based、ViT-based 以及 Swin Transformer-based models 作為對比學習的基線模型。最終做 embedding ensemble 時，每幀視頻共提交了 4 組 embedding，拼接后經過 PCA 算法降維到官方要求的維度。

損失函數(shù):在損失函數(shù)上，除了常用的 InfoNCE Loss，微信視覺團隊參考 SSCD [1] 引入了 Differential Entropy Loss [3]，該損失的作用可以直觀地理解為在特征空間中將同一 batch 內最近的負樣本推遠。

公式中的 N 表示 batch 中的樣本數(shù)量，z 表示圖像特征，表示除了 i 以外的樣本。

2. Video Editing Detection

微信視覺團隊統(tǒng)計發(fā)現(xiàn)，無增強視頻通常不是拷貝視頻，并且會帶來錯誤的召回，而圖像表征模型訓練得越好，這種錯誤召回的置信度就越高，所以在單幀的語義表征層面很難處理這種情況。因此，微信視覺團隊用一個 video-level 的分類模型來初步判斷 query 中是否存在增強信息，如不存在增強，就使用一個模值非常小的隨機向量作為 query 的表征，這樣在召回過程中與任意 reference 的拷貝置信度非常小，不會產生置信度很高的錯誤召回。

Video Editing Detection 的模型結構為 CLIP [2] 和 Roberta [4,6] 兩個部分，微信視覺團隊用 CLIP ViT-L/14 提取視頻幀特征，然后將特征序列輸入到 Roberta 模型中，進行二分類，這個模型在比賽數(shù)據(jù)集上的 Accuracy 和 AP 都可以達到以上。

3. Frame Scenes Detection

在該任務場景中，多場景視頻是一種典型的困難樣例，微信視覺團隊發(fā)現(xiàn)多場景通常是在水平或垂直方向上的拼接，這使得采用傳統(tǒng)的邊緣檢測方法就能檢測幀內是否存在不同的場景區(qū)域并進行切分子圖。微信視覺團隊將切分后得到的子圖也分別提取特征，作為該視頻幀的表征。

Matching Track 解決方案

圖 8：Matching Track 解決方案，(a) Feature Extraction 前處理模塊提取視頻幀粒度的特征矩陣；(b) Similar Segment Matching 模塊基于相似性矩陣預測潛在的 copy 路徑；(c) Similar Segment Parsing 模塊解析得到具體的 copy 片段。

解決方案

1. Feature Extraction

微信視覺團隊的 Matching Track 解決方案是在 Descriptor Track 方案的基礎上設計的，因此團隊沿用 Descriptor Track 的 Frame Scenes Detection 以及 Frame-Level Embedding 兩個模塊來做前處理提取特征。在 Matching Track 的方案中，由于存在更細粒度的后處理模塊，故沒有 Video Editing Detection 模塊。

2. Similar Segment Matching

微信視覺團隊的解決方案基于 query 視頻和 reference 視頻的相似度矩陣來定位 copy 片段，將 query 視頻和 reference 視頻統(tǒng)一截斷或者填充到長寬均為 128 的相似度矩陣，采用高分辨率網絡 HRNet-w18 [8] 作為處理相似度矩陣圖的骨干網絡，輸出的目標為根據(jù) ground truth 生成的熱圖，以準確反映匹配關系。在下圖 9 中展示了一些實際的輸入和輸出的樣例，可以看到左側 3 個 copy 片段在模型處理后十分明顯。

圖 9：Matching Track 模型處理的一些實例，第一行為輸入的原始相似度矩陣，第二行為經過 HRNet 后輸出的匹配關系圖，左邊 3 個例子為存在 copy 片段的結果，右邊 2 個為不存在 copy 片段的結果。

3. Similar Segment Parsing

在獲得準確的匹配關系圖后，需要從中解析得到 copy 片段的具體位置，在這里微信視覺團隊有兩個設計：(1) 使用分類模型來過濾錯誤的 matching 結果，比如圖 7 中右邊 2 個樣例，通過一個簡單的分類模型可以過濾掉。(2) 使用連通分量算法和 RANSAC 回歸算法 [9] 來識別匹配圖中的 copy 片段的位置。

比賽結果

微信視覺團隊團隊最終包攬了 Video Similarity Challenge 雙賽道的冠軍，并且在得分上遠超其他團隊，證實了微信視覺團隊解決方案的有效性。

Descriptor Track 最終榜單

Matching Track 最終榜單

在 Descriptor Track 上，微信視覺團隊在 uAP 指標上取得了的成績，相比第 2 名的有明顯提升；在 Matching Track 上，微信視覺團隊的方案展現(xiàn)了驚人的效果，在 uAP 指標上達到了，遠超其他參賽者，本文的相關工作已發(fā)表于 [10][11] 。

總結和展望

微信視覺團隊在 Video Similarity Challenge 上奪得雙賽道冠軍，展示了團隊在視頻相似檢索和 copy detection 技術上處于業(yè)界領先地位。目前，相關技術已經落地到視頻號產品中，后續(xù)將持續(xù)優(yōu)化，打擊黑灰產，維護微信的內容生態(tài)健康。

參考文獻

[1] Ed Pizzi, Sreya Dutta Roy, Sugosh Nagavara Ravindra, Priya Goyal, and Matthijs Douze. A self-supervised descriptor for image copy detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14532–14542, 2022

[2] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021

[3] Alexandre Sablayrolles, Matthijs Douze, Cordelia Schmid, and Herv ?e J ?egou. Spreading vectors for similarity search. arXiv preprint arXiv:, 2018

[4] Yiming Cui, Wanxiang Che, Ting Liu, Bing Qin, Shijin Wang, and Guoping Hu. Revisiting pre-trained models for Chinese natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: Findings, pages 657–668, Online, Nov. 2020. Association for Computational Linguistics.

[5] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. In International conference on machine learning, pages 1597–1607. PMLR, 2020

[6] Yiming Cui, Wanxiang Che, Ting Liu, Bing Qin, Ziqing Yang, Shijin Wang, and Guoping Hu. Pre-training with whole word masking for chinese bert. arXiv preprint arXiv:, 2019

[7] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:, 2020.

[8] Ke Sun, Bin Xiao, Dong Liu, and Jingdong Wang. Deep high-resolution representation learning for human pose estimation. In CVPR, 2019

[9] Martin A Fischler and Robert C Bolles. Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Communications of the ACM, 24 (6):381–395, 1981

[10] Tianyi Wang, Feipeng Ma, Zhenhua Liu, Fengyun Rao. A Dual-level Detection Method for Video Copy Detection. arXiv preprint arXiv:, 2023.

[11] Zhenhua Liu, Feipeng Ma, Tianyi Wang, Fengyun Rao. A Similarity Alignment Model for Video Copy Segment Matching. arXiv preprint arXiv:, 2023.

[12] Giorgos Kordopatis-Zilos, Symeon Papadopoulos, Ioannis Patras, and Ioannis Kompatsiaris. Visil: Fine-grained spatio-temporal video similarity learning. In

IEEE International Conference on Computer Vision (ICCV), 2019.

[13] Chen Jiang, Kaiming Huang, Sifeng He, et al. Learning segment similarity and alignment in large-scale content based video retrieval. In Proceedings of the 29th ACM International Conference on Multimedia. 2021.

[14] Sifeng He, Yue He, Minlong Lu, Chen Jiang, et al. TransVCL: Attention-enhanced Video Copy Localization Network with Flexible Supervision. arXiv preprint arXiv:.

[15] Douze, Matthijs, Hervé Jégou, and Cordelia Schmid. An image-based approach to video copy detection with spatio-temporal post-filtering. IEEE Transactions on Multimedia, 2010.

[16] Tan, Hung-Khoon, et al. Scalable detection of partial near-duplicate videos by visual-temporal consistency. In Proceedings of the 17th ACM international conference on Multimedia. 2009.

[17] Chou, Chien-Li, Hua-Tsung Chen, and Suh-Yin Lee. Pattern-based near-duplicate video retrieval and localization on web-scale videos. IEEE Transactions on Multimedia, 2015.

[18] Mathilde Caron, Ishan Misra, Julien Mairal, Priya Goyal, Piotr Bojanowski, and Armand Joulin. Unsupervised learning of visual features by contrasting cluster assignments. Advances in neural information processing systems, 33:9912–9924, 2020.

[19] Mathilde Caron, Hugo Touvron, Ishan Misra, Herv ?e J ?egou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF international conference on computer vision, pages 9650–9660, 2021.

[20] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. In International conference on machine learning, pages 1597–1607. PMLR, 2020.

[21] Jean-Bastien Grill, Florian Strub, Florent Altch ?e, Corentin Tallec, Pierre Richemond, Elena Buchatskaya, Carl Doersch, Bernardo Avila Pires, Zhaohan Guo, Mohammad Gheshlaghi Azar, et al. Bootstrap your own latent-a new approach to self-supervised learning. Advances in neural information processing systems, 33:21271–21284, 2020.

[22] Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momen-tum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9729–9738, 2020.

?THE END

轉載請聯(lián)系本公眾號獲得授權

投稿或尋求報道：content@

關鍵詞：

責任編輯：