How do you train a deep neural network to recognize images?

Data collection

Additionally, to classic approaches, synthetic data proves to be a valuable option to quickly add variability, volume, and balance in data. It accelerates object recognition by replacing the need for extensive, time-consuming manual labeling. With synthetics, we create 3D models of objects, place them in a virtual photorealistic environment that mirrors reality, and generate as many images as necessary from different angles, lighting conditions, and cameras plus lens combinations. Data annotation is performed automatically and with 100% accuracy, offering unique advantages for the performance of machine learning models. In our experience, this is especially useful for computer vision in retail with ever-changing product packaging.

Data processing

In AI systems, accuracy is the Holy Grail, and proper data processing is a map leading to it. Every move can make a world of difference. For example: In a scenario related to the detection of small objects within a frame, consider this: If you input the entire frame and it gets resized during processing, your ability to detect objects effectively might be limited. But here’s where solutions like Sahi Library come into play. Simple division of the frame into smaller tiles can significantly improve the outcome. Then, the parts where inference is performed are processed in their original size, ensuring that objects retain their dimensions, and valuable visual features aren’t lost.

Model design

The third step is to design your DNN model to match your task and data. You can use existing architectures like Convolutional Neural Networks (CNNs), Residual Networks (ResNets), or DenseNets, or you can create your own custom layers and connections. You need to decide how many layers, neurons, filters, kernels, and activation functions you want to use, as well as how to initialize your weights and biases. You can use frameworks like Keras, PyTorch, or TensorFlow to design your model.

Model training

It’s important to understand the key hyperparameters involved in model training and how they exactly impact the computer vision problem you are solving. Generally, optimizer and learning rate should be the default ones, batch size, and loss function are candidates to be tuned and the augmentation pipeline should be carefully designed task to task. A good practice would be to track a large range of evaluation factors during the training: visualizations, metrics, loss values, stability measurements, etc. The training process decisions should be based on the complete analysis of the evaluation factors, and custom computer vision quality assurance procedures.

Model evaluation

It starts with a simple question: “What’s the purpose of this specific model?”. With answers in mind, the evaluation metrics shall be chosen accordingly. Every metric is designed to measure a particular aspect of the model’s behavior. Accuracy is the answer to “How well the model detects each class”, but it may fail when classes are imbalanced or we need perfect detection of all cases for one class. That’s when precision and recall come to the scene – as an alternative to accuracy to give a detailed explanation of the model’s tendencies. Also, consider using: IoU (Intersection over Union) to evaluate the Object Detection models and the mAP (mean average precision) – the metric is widely used to evaluate the instance segmentation models.

Model deployment

For edge processing (e.g. with Nvidia Jetsons, BYO, or custom hardware), it’s common to employ model quantization either during or after training. It involves converting high-precision model weights into lower-precision representations, like the ONNX (Open Neural Network Exchange) format, reducing memory and computational requirements without compromising accuracy.

Other useful tips

  • Actively use cutting-edge Computer Vision model demos available on the web to generate and validate hypotheses. Bright examples include Segment Anything web demo, LaMa inpainting web demo, and Stable Diffusion web demo. Just upload your data and give it a fast try-on. If the results are satisfactory you can set up the models locally and conduct more extensive experimentation and benchmarking.
  • Visual segmentation is a very powerful tool that can be applied to various applications. The advantages of segmentation against the simple bounding box are substantial. For example, a segmentation mask combined with accurate binary postprocessing would give powerful geometrical insights and better scene-scale understanding. The actual shape of the segmentation stores exclusive information like area, convexity, integrity, orientation, etc. With the release of the Segment Anything model capable of performing large-scale zero-shot segmentation the usage of segmentation in the computer vision task dramatically increased. Additionally, zero-shot segmentation is a very powerful tool for data annotation.
  • Actively use traditional Computer Vision techniques in your image processing pipeline. Morphological processing applied to a binary image provides a range of useful tools: hole filling, largest component highlighting, edge smoothing, skeletonizing, convex hull selection, etc. Accurately combine and pipe binary and morphological transformations to achieve the result you want. More advanced usage includes altering morphological structural elements. Other useful traditional Computer Vision tools include line and curve detection, background subtraction, contour segmentation, pattern matching, etc. Great advantages of traditional methods are flexibility, explainability, and low resource cost.
  • While working with video it may be beneficial to track some specific points through time and postprocess the resulting signal. Keypoint detection is often affected by motion blur, defocus blur, occlusions, appearance, and illumination changes and thus produces quite noisy location signals. Apply signal smoothing and time-series analysis to extract the refined motion model which could be used to obtain distance, speed, rotation and orientation afterwards.
  • Do not rely entirely on deep Computer Vision. Actively use geometry, 3D geometry, camera models, and traditional algorithms. The best solution lies in the proper combination of the above. Also it’s not always required to train the deep Computer Vision model on your data. In most cases, it’s enough to use available pre-trained models with proper integration into the pipeline.
Close Bitnami banner