Sathvick Views: C35 How Computers Learn to See

Computer Vision: From Pixels to Perceptions Briefing

Dr Sudheendra S G provides an overview of key concepts in computer vision, outlining how images are processed, features are extracted, and tasks like classification, detection, and tracking are performed, while also addressing critical ethical considerations.

I. Core Concepts: Pixels, Patches, and Convolution

A. Image Representation: Pixels

Images are fundamentally represented as grids of pixels. Each pixel stores intensity information, either as a single value for grayscale images or as an RGB triplet for color images.

Quote: "Images are grids of pixels. Color often stored as RGB; grayscale is one intensity per pixel."

A simple approach to tracking an object, for instance, might involve selecting a target color and finding the closest RGB match per frame. However, this method is fragile in real-world scenarios due to variations in lighting, shadows, and similar object colors, leading to "failure cases: lighting changes, shadows, jerseys same color → confusions."

B. Feature Extraction: Patches, Kernels, and Convolution

To extract more robust features, computer vision analyzes "patches" of multiple pixels using small matrices called kernels or filters.

Quote: "Many features (e.g., edges) span multiple pixels. We analyze patches using a small matrix called a kernel/filter."

Convolution is the process of applying a kernel to an image patch, involving a "multiply-and-sum" operation, and then sliding this kernel across the entire image. This process generates an "edge map" or other feature maps, where "big magnitude ⇒ likely edge."

Different kernels can be designed to detect various features:

Edge detection: Kernels like Prewit or Sobel highlight vertical or horizontal edges.
Blurring: A "box blur" kernel averages pixel values, smoothing the image.
Sharpening: An "unsharp mask style" kernel enhances details.

II. Evolution of Feature Detection: Handcrafted vs. Learned

A. Handcrafted Features: Viola–Jones Algorithm

Early computer vision methods, like the classic Viola–Jones algorithm, rely on hand-designed features to identify objects. These methods stack "simple cues (lines, dark-on-light patterns)" to find objects without relying on color information.

Quote: "Viola–Jones (classic method) uses fast rectangular features (Haar-like) and scans a window across the image."

Haar-like features are small, rectangular patterns (e.g., light-dark pairs for a nose bridge, three-stripes for an eye region, or a surrounded dark blob for a pupil) that are quickly computed across an image using a "sliding window" approach. The combination of many "weak features" leads to a "strong detector."

B. Learned Features: Convolutional Neural Networks (CNNs)

Modern computer vision predominantly uses Convolutional Neural Networks (CNNs), which automatically "learn the filters instead of hand-designing them."

Quote: "CNN layers perform convolutions with learned kernels."

CNNs operate in layers, creating a feature hierarchy:

Early layers learn basic features like "edges."
Later layers learn more complex patterns like "corners/parts."
Deeper layers learn "object templates" (e.g., faces).

The CNN pipeline typically involves repeated "Conv + ReLU" and "Conv + Pooling" layers, where pooling "downsamples" the feature maps. This process helps to "reduce detail while raising abstraction," ultimately leading to "feature maps" that can be used for "class scores." Training CNNs involves labeled data and backpropagation to adjust kernel weights.

III. Computer Vision Tasks and Metrics

A. Classification, Detection, and Tracking

Computer vision encompasses various tasks:

Classification: Assigning "one label for the whole image" (e.g., "this image contains a cat").
Detection: Identifying objects within an image and providing bounding boxes around them (e.g., "there is a cat at these coordinates").
Tracking: Following objects "across frames" in a video sequence. Challenges include "lighting changes, occlusion, motion blur," and re-identification when objects disappear and reappear.

B. Key Metrics

Intersection-over-Union (IoU): A common metric for evaluating the quality of object detection. It measures the overlap between a predicted bounding box and the ground-truth bounding box, calculated as "overlap area / union area." A higher IoU indicates a more accurate detection.
Precision and Recall: Important metrics, especially for detection and imbalanced datasets, to assess the accuracy and completeness of detections.

C. Facial Landmarks

Beyond detection, models can predict landmarks (e.g., "eyes, nose tip, mouth corners") on objects like faces. These landmarks enable detailed analysis, such as "expression checks (smile?), state (eyes open?), and alignment for recognition."

IV. Ethical Considerations and Limitations

Computer vision systems, while powerful, present significant ethical challenges and inherent limitations:

A. Bias and Fairness

Data Bias: "Models learn data patterns—including bias." If training data is unrepresentative or biased, the model will inherit and amplify those biases, leading to unfair or inaccurate outcomes across different demographic groups.
Mitigation: This requires "bias audits" and evaluating models "across groups."

B. Privacy and Consent

Surveillance: "Vision systems raise privacy and consent issues (surveillance, face recognition)." The widespread deployment of cameras and facial recognition technology raises concerns about individual freedoms and the potential for misuse.
Mitigation: Emphasizing "consent, on-device processing, opt-out, human oversight," documenting datasets, limiting data retention, and ensuring secure storage. Clear purpose definition for data usage is crucial.

C. Real-World Fragility

Environmental Factors: Vision systems can be fragile in diverse real-world conditions, sensitive to "lighting, angle, occlusions" (when an object is partially or fully hidden).
Domain Shift: Performance can degrade significantly during a "domain shift" (e.g., a model trained in a laboratory setting performing poorly on a crowded street).
Misconceptions: It's important to remember that "Vision = classification" is a misconception; vision encompasses detection, segmentation, landmarks, and tracking. Also, "More filters = always better" is not true, as data quality and evaluation are more important. "Accuracy alone is fine" is also a misconception, especially for detection and imbalanced data, where precision/recall and IoU are critical.

V. Conclusion

Computer vision is a transformative field that turns "pixels → patterns → decisions." From the fundamental concepts of pixels and convolution with handcrafted features like Viola–Jones, the field has evolved to leverage powerful deep learning techniques in Convolutional Neural Networks for learned feature hierarchies. While enabling advanced tasks like detection, tracking, and landmark prediction, it is imperative to address the profound ethical implications of bias, privacy, and consent, alongside acknowledging the inherent fragility of these systems in complex real-world environments. Responsible design, rigorous evaluation, and transparent deployment are paramount.

Sathvick Views

Sunday, August 24, 2025

C35 How Computers Learn to See

No comments:

About Me

Blog Archive