BatchNorm1d ( width ) SwishActivation () nn. AdaptiveAvgPool1d ( 1 ) # average in time dimension
TALESRUNNER SOLAR WING CODE
I wanted to deconstruct the code for a very simple PyTorch model that placed 21st in the competition 1. Simple Modelīy and large, I took advantage of work other people have made available. The task is to provide 100,000 examples of each class in decreasing order of confidence from a collection of > 2,000,000 clips. The validation training set for the challenge contains 237,667 clips from 47,059 unique videos. This metric is a bit tricky to understand (and even harder to implement), but I will skip over the details here except for mentioning that compared to global average precision, this metric favors correctly classifying rare events. The specific metric used to evaluate the task is Mean Average Precision K where K=100,000. Specifically, the task is that given a class we predict the relevant segments (and not the other way around!). There are a few important notes for what the task is. These spectrograms are then fed into a CNN and the embeddings are extracted and quantized into 128 8-bit features. Spectrums were calculated at 10ms intervals and converted to log transformed 64 bin mel-spaced frequency bins.
![talesrunner solar wing talesrunner solar wing](https://cdn.cloudflare.steamstatic.com/steamcommunity/public/images/items/328060/eb5e85cf31b547d4e66e159dff01d2fbb74891fd.jpg)
The audio features were extracted using a CNN model over dense spectrograms following Hershey et. This was then followed by PCA, and then quantization (1 byte per coefficient). This created a 2048 dimensional feature vector per second of video.
![talesrunner solar wing talesrunner solar wing](https://www.onrpg.com/wp-content/gallery/Kakele/Kakele-Screenshot1.jpg)
For the visual features, the videos are subsampled to 1-frame/second which was fed into an Inception-V3 network trained on imagenet. Each class is a knowledge graph entity (analogous to a wordnet ID), and each segment is broken into 5 frame-like data points.īriefly, each frame-like data-point contains audio-visual features for 1 second of the clip. This dataset contains 1,000 classes (which were selected to be temporally localizable). While several youtube based datasets exist, I will be using the 237k Human-verified 5 second clips ( here). The overall goal of the Yt8M competition is to temporally localize topics of interest. The obvious reason a company has interest in this is to determine video contents that are not described by the uploader. Many companies have an interest in things like labeling videos with meta-information such as events happening in a videoĪn example of the detected action "blowing out candles" Kaggle hosted the 3rd YouTube-8M Video understanding challenge. This is a post to cover the basics of the YouTube-8M challenge