**Taught by Patrick Hebron at ITP, Fall 2016**

In a supervised learning procedure, the machine tries to learn the underlying relationships between a set of example inputs and their corresponding outputs.

In an unsupervised learning procedure, the machine only tries to learn about the internal structure of the input examples.

This allows us to extract meaningful features and patterns from the input data.

Unsupervised learning is useful in various forms of data analysis and dimensionality reduction.

Unsupervised learning can also be applied to generative or creative applications. For example, we could learn about the kinds of patterns that exist within a particular composer's work and then use this to generate new compositions in the style of that artist.

It can also be useful in improving the quality of a supervised learning procedure. This is particularly true in cases where we have far more input examples than we have correctly labeled corresponding outputs. For example, the internet provides an endless supply of images from which we could learn internal patterns. But only a small portion of these images come with reliable descriptions of the images' contents. Unsupervised learning allows us to learn common patterns across both labeled and unlabeled images and then use these underlying patterns to associate previously unlabeled images with labels that correspond to structurally similar images.

This approach bears some resemblance to our own learning process in the sense that we have many experiences interacting with a particular kind of object, but a much smaller number of experiences in which another person explicitly tells us the name of that object.

Unsupervised learning in general and in its relation to supervised learning can tell us a lot about the nature of thought and the representation of ideas.

**This section is based on my article, Machine Learning and Improvisation.**

A Restricted Boltzmann Machine (RBM) is a two-layer neural network composed of a *visible layer* and a *hidden layer*.

Each unit in the visible layer is connected to each unit in the hidden layer.

The visible units are not connected to one another.

The hidden units are not connected to one another.

An RBM observes real-world patterns and tries to create a lower-dimensional representation of those patterns.

We can also stack multiple RBMs on top of one another to reduce the dimensionality of the patterns even further.

One common architecture for stacked RBMs is called a Deep Belief Network (DBN).

The goal of the training process is to produce lower dimensional representations that can then be used to “reconstruct” approximations of the original inputs.

In the animation below, we will train a DBN on leaf images from 100 species, using 16 images per species.

You can think of this as a kind of compression algorithm, improvised in relation to the neural network’s experience.

The network compresses information by finding component patterns across many input examples. It can then use these component patterns as building blocks to describe the whole.

This is an efficient way to store information because it means we don’t need to hold onto every detail of every input example. We can use a more general vocabulary derived from all of the examples to describe each particular example.

The reconstruction of an input example through this process is not exact.

But that’s what’s interesting about it!

Notice how the approximation changes as we go to deeper layers of the neural network:

And notice what happens when we reconstruct partially obfuscated images:

This is somewhat like our minds filling in the missing pieces of a face that has been partially occluded by some other object such as a telephone pole.

Representing experiences through a shared set of component patterns means that we don’t have to treat each as entirely separate from or incomparable to each other. We can fill gaps in our experiences by borrowing from our memory of other similar experiences.

The \(x\) component is multiplied by a \(weight\) and added to a \(bias\).

The result of this operation is then passed through an activation function.

The resulting \(activation\) represents the strength of the signal passing through the hidden node, given input \(x\).

More generally:

\( activation = activationFunction( weight * input + hiddenBias ) \)

Each \(x\) component is multiplied by a separate \(weight\). These products are summed together and added to a \(bias\).

The result of this operation is passed through an activation function.

Each hidden node receives the components of the input signal, multiplied by their corresponding \(weights\). These products are summed together and added to the hidden node's \(bias\). The results of these operations are passed through an activation function.

During reconstruction, the hidden activations are passed backward through the network.

The \(activations\) are multiplied by their corresponding \(weights\). These products are summed together and added to the visible node's \(bias\). The results of these operations are passed through an activation function.

More generally:

\( reconstruction = activationFunction( {activation}^{T} * weight + visibleBias ) \)

The output of this reconstruction process acts as an approximation of the original input.

The forward and backward passes described above give us two symmetrical processes for abstracting and deabstracting (or, compressing and decompressing) information:

**Get Hidden Activations (Propagate Up):**

\( activationsH = \frac{1}{1 + {e}^{-(inputV \cdot weights + biasH)}} \\ \)

**Get Visible Activations (Propagate Down):**

\( activationsV = \frac{1}{1 + {e}^{-(inputH \cdot {weights}^{T} + biasV)}} \\ \)

**Based on CSC321: Neural Networks, by Geoffrey Hinton.**

**Learning Rule:**

\( \Delta{{W}_{ij}} = learningRate * ( {\langle{v}_{i}{h}_{j}\rangle}^{0} - {\langle{v}_{i}{h}_{j}\rangle}^{\infty} ) \)

**Learning Procedure:**

- Apply a training example to the visible units
- Alternate between:
- Updating all hidden units in parallel
- Updating all visible units in parallel

- After many alternations (timestep \(\infty\)), the system will reach
*thermal equilibrium*. - We can now measure the correlation between \(v_i\) and \(h_j\).

**Limitations:**

- We have to perform this alternation many times before it reaches thermal equilibrium.
- If we don't perform it a sufficient number of times, the learning may go wrong.
- But, it turns out that we can take a shortcut: Contrastive Divergence.

**Learning Rule:**

\( \Delta{{W}_{ij}} = learningRate * ( {\langle{v}_{i}{h}_{j}\rangle}^{0} - {\langle{v}_{i}{h}_{j}\rangle}^{1} ) \)

**Learning Procedure:**

- Apply a training example to the visible units
- Update all hidden units in parallel
- Update all visible units in parallel
- Update all hidden units again

Instead of waiting for thermal equilibrium, we use the measurements produced by one update step.

**Why Contrastive Divergence Works:**

- If we start from the data, the Markov chain will wander away from the data values themselves and towards values it likes more (an equilibrium distribution).
- After only a few steps, we can see what direction it is wandering in.
- Since we know the initial weights are bad, it is wasteful to let them go all the way to equilibrium. We know how to change the weights to stop them from wandering away from the data without going all the way to equilibrium.

- We simply need to lower the probability of the reconstructions it produces after one full step and then raise the probability of the data.
- This will stop the chain from wandering away from the data.
- The learning will cancel out once the reconstructions and the data have the same distribution as one another.

```
import numpy as np
# Sigmoid function
def sigmoid(x):
return 1.0 / ( 1.0 + np.exp( -x ) )
# Bernoulli Restricted Boltzmann Machine class
class RBM:
def __init__(self, sizeV, sizeH):
# Initialize random number generator:
self.rng = np.random.RandomState()
# Initialize weights:
self.weights = np.array( self.rng.uniform( -1.0 / sizeV, 1.0 / sizeV, ( sizeV, sizeH ) ) )
# Initialize biases:
self.biasH = np.zeros( sizeH )
self.biasV = np.zeros( sizeV )
def train(self, data, training_epochs, learning_rate, cd_steps):
# Perform each training epoch:
for epoch in xrange( training_epochs ):
# Get hidden activations and samples:
aH_0, sH_0 = self.getHiddenSample( data )
# Perform each contrastive divergence step:
for i in xrange( cd_steps ):
aV_inf, sV_inf, aH_inf, sH_inf = self.getGibbsHvh( ( sH_0 if i == 0 else sH_inf ) )
# Update weights:
self.weights += learning_rate * ( np.dot( data.T, aH_0 ) - np.dot( sV_inf.T, aH_inf ) )
# Update biases:
self.biasV += learning_rate * np.mean( data - sV_inf, axis = 0 )
self.biasH += learning_rate * np.mean( aH_0 - aH_inf, axis = 0 )
def getHiddenActivations(self, inputV):
return sigmoid( np.dot( inputV, self.weights ) + self.biasH )
def getVisibleActivations(self, inputH):
return sigmoid( np.dot( inputH, self.weights.T ) + self.biasV )
def getHiddenSample(self, inputV):
aH = self.getHiddenActivations( inputV )
sH = self.rng.binomial( 1, aH, aH.shape )
return [ aH, sH ]
def getVisibleSample(self, inputH):
aV = self.getVisibleActivations( inputH )
sV = self.rng.binomial( 1, aV, aV.shape )
return [ aV, sV ]
def getGibbsHvh(self, inputH):
aV, sV = self.getVisibleSample( inputH )
aH, sH = self.getHiddenSample( sV )
return [ aV, sV, aH, sH ]
def getGibbsVhv(self, inputV):
aH, sH = self.getHiddenSample( inputV )
aV, sV = self.getVisibleSample( sH )
return [ aH, sH, aV, sV ]
def getReconstruction(self, inputV):
aH = self.getHiddenActivations( inputV )
aV = self.getVisibleActivations( aH )
return aV
```

Take a look at the full Restricted Boltzmann Machine example code.

- Project Brainstorming: Identify an applied learning problem that is of interest to you. Consider the problem, what sort of learning algorithm you would use to address it, what auxiliary tools would be needed and so forth. Try to identify challenges, limitations, etc. Also, do some research into if/how others have tried to solve a similar problem. What tools did they use? Put together a brief presentation that describes your idea and outlines what further research and implementation work would be required to realize this idea.
- In the second half of the semester, we will explore open source machine learning toolkits including TensorFlow, Theano and Keras. In order to simplify the process of installing and working with these tools, we will use Docker and a new tool for interfacing with Docker called LaunchBot. LaunchBot is developed by Andrew Odewahn, who has generously offered to share this tool with our class. We'll talk more about these technologies in class. To get ready for that, please do the following before our class next week:
- Please sign-up for a LaunchBot account.
- Please follow the steps listed in the LaunchBot Getting Started document.
- Please email me if you have any trouble with the above steps.

**Primary Source:***Distributed Representations of Words and Phrases and their Compositionality*by Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean.