This project implements a pipeline for video feature extraction using a pretrained Swin3D-B model and evaluates performance under cross-subject conditions.
The goal is to extract meaningful video representations and assess generalization to unseen subjects using GroupKFold validation.
Feature extractor:
Classifier:
transfer_files/
├── *.mp4
├── labels_lesson_CV.txt
└── features/ (generated embeddings)
Labels file format:
video_name,label_id
Subject ID is extracted from filename: subject = videoname.split(“”)[0]
Create environment:
python -m venv venv source venv/bin/activate
Install dependencies:
pip install torch torchvision pandas scikit-learn av
PyAV is required for video decoding.
Run the full pipeline:
python lab3.py
Outputs:
Validation method: GroupKFold (5 folds)
Groups = subject IDs
This ensures:
Fold 0: 0.000
Fold 1: 0.000
Fold 2: 0.375
Fold 3: 0.286
Fold 4: 0.143
Mean accuracy ≈ 0.16
Low performance is expected due to:
Zeev Weizmann
MSc Data Science & AI
Université Côte d’Azur
Page
https://zeevweizmann.github.io/Feature-Extraction-with-Video-Swin-Transformer/
Code
https://github.com/ZeevWeizmann/Feature-Extraction-with-Video-Swin-Transformer