Figure skating scoring is challenging because it requires judging the technical moves of the players as well as their coordination with the background music. Most learning-based methods cannot solve it well for two reasons: 1) each move in figure skating changes quickly, hence simply applying traditional frame sampling will lose a lot of valuable information, especially in 3 to 5 minutes long videos; 2) prior methods rarely considered the critical audio-visual relationship in their models. Due to these reasons, we introduce a novel architecture, named Skating-Mixer. It extends the MLP framework in a multimodal fashion and effectively learns longterm representations through our designed memory recurrent unit (MRU). Aside from the model, we collected a highquality audio-visual FS 1000 dataset, which contains over 1000 videos on 8 types of programs with 7 different rating metrics, overtaking other datasets in both quantity and diversity. Experiments show the proposed method achieves SOTAs over all major metrics on the public Fis-V and our FS 1000 dataset. In addition, we include an analysis applying our method to the recent competitions in Beijing 2022 Winter Olympic Games, proving our method has strong applicability.
|Original language||English (US)|
|Title of host publication||Proceedings of the AAAI Conference on Artificial Intelligence|
|Publisher||Association for the Advancement of Artificial Intelligence (AAAI)|
|Number of pages||9|
|State||Published - Jun 26 2023|
Bibliographical noteKAUST Repository Item: Exported on 2023-08-29
Acknowledgements: This work was supported by the National Key R&D Program of China (Grant NO. 2022YFF1202903) and the National Natural Science Foundation of China (Grant NO. 61972188 and 62122035).