lecture 8
walking tour of AI developments in computer vision
(meet the big namers)
+
revision
SeAts APp SEAtS ApP SEaTS APP
welcome back in the new year 🧤
lecture plan
Part 1 🧠
- Mnist and LeNet
- ImageNet and AlexNet
- VGG, Inception, Resnet
- demo
- Style transfer
Part 2 🧑‍🎤
our deal! + walking tour of this unit's summary sheet and mock questions
when we think about AI:
on apple's machine learning model page:
what is the connection between these?
pattern recognition -> pattern generation
and more... let's take a look at the milestones!
big picture 01:
every big name model is usually chracterised by one or few brilliant architectural designs (aka new layer type).
big picture 02:
we often talk about brilliant designs of AI models and sometimes overlook the importance of dataset
Back to the year of 1998
"hello world" dataset of machine learning - MNIST, handwritten digits images with labels
Lenet: one of the earlies Convolutional Neural Network, with conv and pooling layers, the prototype of everything follows
But what next? Where should computer vision go? In the early 2000s, computer vision tasks included matching satellite images, image stitching, 3D scene reconstruction ...
finding the focal point, the right level of abstraction: obejct recognition
2005 PASCAL Visual Object Classes Challenge
it has an annotated dataset
number of training images: 1500±
image dimension: RGB 450*280 ish
four classes: motorbikes, bicycles, people, and cars
type of tasks: let's have a look!
type of tasks:
classification: outputs label
obejct bouding box detection: outputs bounding box
object segmentation: outputs pixel mask
these are the milestone tasks for computer vision
while now "gathering data" is just a everyday word, back at 2005 there was not so much mindset about the "data"
the visionary ImageNet was introduced in 2010
let's look at its visionary scale:
number of training images: 1,281,167
image dimension: RGB 469x387 on average
1000 classes: based on WordNet
tasks: object labels and bounding boxes
and introducing the godmother of recent computer vision developments Feifei Li
another visionary design of ImageNet: 1000 classes from WordNet
WordNet consists of many English-language terms organized into an ontological structure
Wordnet example: a lexical database that is ontologically structured
it is bridging computer vision with cognitive science and 1000-class classification task is far beyond the capability of contemporary models
ImageNet challenge top scorers from 2010 to 2017
Explanation of "top-5" error on whiteboard
- 1. It was really bad in 2010 and 2011
- 2. Everything changed from 2012, from AlexNet onwards

- From AlexNet in 2012, we saw an explosion of AI models in CV.
- Let's take a look at some of the big namers (they are all image classifiers initially trained on ImageNet dataset).
just for a reference
Alexnet: the first CNN that goes "deep", and uses GPU for training
VGG: deeper, with smaller filter size in conv layers
Resnet: residual modules ("new layer type"), connections jumping layers
Inception, or GoogleNet: inception modules ("new layer type"), it goes "wide"
all these trained-on-imagenet models can be used as a good starting point for "any" vision task
think of these models as "vision task bootcamp" graduates.
they are good visual feature extractors
AI models nowadays
let's look at some of the apple models and corresponding tasks, now this page should be much more familiar
play around with pose detection
style transfer:
input: one content image, one style image
output: content image dipped with the style some online example
wait how does this related to the CNN VGG jargons we just loaded with?
the algorithmic design of neural transfer:
it is an optimisation formalisation comprising two constraints -
-- the generated image should be similar in content with the content image
-- the generated image should be similar in style with the content image
how do we know if two images are similar in content or style?
in other words, how to numberify the content and style???
original quote from this paper :
-- "Two images are similar in content if their high-level features as extracted by a trained classifier are close."
-- "Two images are similar in style if their low-level features as extracted by a trained classifier share the same statistic."
guess what? they use VGG as the feature extractor.
This technical idea also entails discussion. Are styles really can be represented by a bunch of numbers output from VGG? From the hyped style transfer result...
a nice project that includes style transfer
stay tuned..
next unit ML Two we'll be looking into hands-on practice (preparing dataset, training, implementation, etc.)
the deal...
presentation brief
revision time!
don't forget to send initial presentation ideas to me next thursday (skechy ones are encouraged! ) and if you feel stuck, let me know!
reference
- Imagenet challenge results
- types of CNN
- Lenet 5
- VGG16