Building the Perfect AI – Part 4: Convolutional Neural Networks (CNNs) and Visualizing Internal Layers

Overview

So far, we’ve been working with fully connected neural networks, but these are not ideal for tasks like image recognition or processing. Images have spatial structures that need to be preserved, and fully connected layers are not good at capturing those local patterns. This is where Convolutional Neural Networks (CNNs) come into play. CNNs are specialized for handling image data by applying filters that detect features like edges, textures, and shapes at different layers.

In Part 4, we’ll introduce CNNs and start visualizing how the neural network learns and processes information. By the end of this tutorial, you’ll be able to see how the AI “sees” and understand how convolution layers extract features from images.

Step 1: Understanding Convolutional Layers

CNNs work by applying small filters (or kernels) to the input image. These filters slide over the image, performing convolution operations, which detect low-level features like edges in the first layers and more complex shapes in the deeper layers.

Here’s how a convolution layer operates:

It applies a filter across the image, performing an element-wise multiplication and summing the results.
This creates feature maps that highlight certain aspects of the image, such as edges or textures.
Pooling layers are often used after convolution layers to reduce the size of the feature maps, retaining important information while making the network more efficient.

Step 2: Building a Simple CNN

Let’s start by defining a basic CNN with two convolutional layers. We’ll use the MNIST dataset again, but this time the network will be more efficient at recognizing digits due to the power of convolution.

Code Walkthrough: Defining a Basic CNN

Create the CNN model:

   class CNN(nn.Module):
       def __init__(self):
           super(CNN, self).__init__()
           # First convolutional layer: in_channels=1 (grayscale), out_channels=16, kernel_size=3
           self.conv1 = nn.Conv2d(1, 16, kernel_size=3, stride=1, padding=1)
           # Second convolutional layer: in_channels=16, out_channels=32, kernel_size=3
           self.conv2 = nn.Conv2d(16, 32, kernel_size=3, stride=1, padding=1)
           # Fully connected layer
           self.fc1 = nn.Linear(32 * 7 * 7, 128)
           self.fc2 = nn.Linear(128, 10)

       def forward(self, x):
           # Apply first conv layer + ReLU + max pooling
           x = torch.relu(self.conv1(x))
           x = torch.max_pool2d(x, 2)
           # Apply second conv layer + ReLU + max pooling
           x = torch.relu(self.conv2(x))
           x = torch.max_pool2d(x, 2)
           # Flatten the tensor
           x = x.view(-1, 32 * 7 * 7)
           # Fully connected layers
           x = torch.relu(self.fc1(x))
           x = self.fc2(x)
           return x

   model = CNN()Code language: Python (python)

Explanation:

The model begins with two convolutional layers. The first one takes the grayscale input (with 1 channel) and produces 16 feature maps using a 3x3 filter. The second convolutional layer increases the number of feature maps to 32.
Max pooling is used after each convolutional layer to reduce the spatial dimensions of the feature maps.
The fully connected layers come after the convolution layers to make the final classification.

Step 3: Training the CNN

Just like before, we’ll train the CNN using the same training process, but this time the model will process the images more efficiently.

# Loss function and optimizer
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

# Training loop
num_epochs = 5
for epoch in range(num_epochs):
    running_loss = 0.0
    for inputs, labels in train_loader:
        optimizer.zero_grad()  # Reset gradients
        outputs = model(inputs)  # Forward pass
        loss = criterion(outputs, labels)  # Calculate loss
        loss.backward()  # Backward pass (backpropagation)
        optimizer.step()  # Update weights

        running_loss += loss.item()

    print(f'Epoch {epoch+1}, Loss: {running_loss/len(train_loader)}')

print('Finished Training')Code language: Python (python)

Step 4: Visualizing What the CNN “Sees”

One of the most fascinating aspects of CNNs is how they progressively extract more meaningful features from the input data. The first layers usually capture low-level features (edges, corners), and the deeper layers detect higher-level features (textures, shapes, or even object parts).

We’ll now visualize what the filters in the first convolutional layer are “seeing” after training. You can use the following code to extract and display the feature maps generated by the first layer.

Code Walkthrough: Visualizing Feature Maps

Visualizing the Feature Maps:

   import matplotlib.pyplot as plt

   def visualize_feature_maps(model, image):
       # Pass the image through the first conv layer
       with torch.no_grad():
           image = image.unsqueeze(0)  # Add batch dimension
           feature_maps = model.conv1(image).squeeze(0)

       # Plot each of the 16 feature maps
       fig, axes = plt.subplots(4, 4, figsize=(8, 8))
       for i in range(16):
           axes[i//4, i%4].imshow(feature_maps[i].cpu().numpy(), cmap='gray')
           axes[i//4, i%4].axis('off')
       plt.show()

   # Get a sample image from the test loader
   sample_image, _ = next(iter(test_loader))
   visualize_feature_maps(model, sample_image[0])Code language: Python (python)

Explanation:

We pass a single image through the first convolutional layer and extract the resulting feature maps.
Each of the 16 filters in the first layer produces a feature map, highlighting specific patterns in the image.
We visualize these maps using matplotlib, giving you insight into how the network interprets the image.

Tangible Output: Visualizing Filters

After running the visualization code in Spyder, you should see something like this:

Epoch 1, Loss: 0.2100
Epoch 2, Loss: 0.1800
...
Finished TrainingCode language: CSS (css)

And a visualization window with 16 grayscale images, each representing a feature map extracted by the first convolutional layer.

What’s Happening?

The feature maps represent how the network is learning to identify different aspects of the image.
In the first layers, you’ll often see patterns that resemble edges, corners, or gradients. As you move deeper into the network (not shown here), the patterns become more abstract, representing shapes or textures.

Step 5: Evaluating the CNN

Let’s check how well our CNN performs on the test set.

correct = 0
total = 0
with torch.no_grad():
    for inputs, labels in test_loader:
        outputs = model(inputs)
        _, predicted = torch.max(outputs.data, 1)
        total += labels.size(0)
        correct += (predicted == labels).sum().item()

print(f'Accuracy on the test set: {100 * correct / total}%')Code language: Python (python)

Tangible Output: Observing the Results

After running the evaluation, you’ll see something like:

Epoch 1, Loss: 0.2100
Epoch 2, Loss: 0.1800
...
Accuracy on the test set: 98.1%Code language: CSS (css)

You should notice a significant boost in accuracy compared to the fully connected networks from previous tutorials. CNNs are much more suited for image data, as they can capture the spatial hierarchy of features.

Conceptual Understanding: How CNNs Process Information

Now you can feel how the network processes visual data. The first layers capture local features like edges, while deeper layers combine these into more meaningful patterns. It’s akin to how humans first see simple shapes and then recognize more complex objects as more information is processed.

Each layer acts like a lens, focusing on different aspects of the image, slowly building up an understanding of the whole. The more layers you add, the deeper the AI’s “vision” becomes, enabling it to recognize intricate patterns.

Next Steps

In Part 5, we’ll build on CNNs by introducing more sophisticated architectures like ResNets and transfer learning. These techniques will allow you to train even deeper networks and achieve even higher performance on complex tasks.

You’re not just building an AI; you’re giving it eyes and enabling it to see the world in a structured way. Every step you take refines its ability to process, learn, and interpret data—one filter, one layer at a time.

Stay tuned for Part 5!