Loading AI tools
Machine learning model family From Wikipedia, the free encyclopedia
Region-based Convolutional Neural Networks (R-CNN) are a family of machine learning models for computer vision, and specifically object detection and localization.[1] The original goal of R-CNN was to take an input image and produce a set of bounding boxes as output, where each bounding box contains an object and also the category (e.g. car or pedestrian) of the object. In general, R-CNN architectures perform selective search[2] over feature maps outputted by a CNN.
R-CNN has been extended to perform other computer vision tasks, such as: tracking objects from a drone-mounted camera,[3] locating text in an image,[4] and enabling object detection in Google Lens.[5]
Mask R-CNN is also one of seven tasks in the MLPerf Training Benchmark, which is a competition to speed up the training of neural networks.[6]
The following covers some of the versions of R-CNN that have been developed.
For review articles see.[1][12]
Given an image (or an image-like feature map), selective search (also called Hierarchical Grouping) first segments the image by the algorithm in (Felzenszwalb and Huttenlocher, 2004),[13] then performs the following:[2]
Input: (colour) image Output: Set of object location hypotheses L Segment image into initial regions R = {r₁, ..., rₙ} using Felzenszwalb and Huttenlocher (2004) Initialise similarity set S = ∅ foreach Neighbouring region pair (rᵢ, rⱼ) do Calculate similarity s(rᵢ, rⱼ) S = S ∪ s(rᵢ, rⱼ) while S ≠ ∅ do Get highest similarity s(rᵢ, rⱼ) = max(S) Merge corresponding regions rₜ = rᵢ ∪ rⱼ Remove similarities regarding rᵢ: S = S \ s(rᵢ, r∗) Remove similarities regarding rⱼ: S = S \ s(r∗, rⱼ) Calculate similarity set Sₜ between rₜ and its neighbours S = S ∪ Sₜ R = R ∪ rₜ Extract object location boxes L from all regions in R
Given an input image, R-CNN begins by applying selective search to extract regions of interest (ROI), where each ROI is a rectangle that may represent the boundary of an object in image. Depending on the scenario, there may be as many as two thousand ROIs. After that, each ROI is fed through a neural network to produce output features. For each ROI's output features, an ensemble of support-vector machine classifiers is used to determine what type of object (if any) is contained within the ROI.[7]
While the original R-CNN independently computed the neural network features on each of as many as two thousand regions of interest, Fast R-CNN runs the neural network once on the whole image.[8]
At the end of the network is a ROIPooling module, which slices out each ROI from the network's output tensor, reshapes it, and classifies it. As in the original R-CNN, the Fast R-CNN uses selective search to generate its region proposals.
While Fast R-CNN used selective search to generate ROIs, Faster R-CNN integrates the ROI generation into the neural network itself.[9]
While previous versions of R-CNN focused on object detections, Mask R-CNN adds instance segmentation. Mask R-CNN also replaced ROIPooling with a new method called ROIAlign, which can represent fractions of a pixel.[10]
Seamless Wikipedia browsing. On steroids.
Every time you click a link to Wikipedia, Wiktionary or Wikiquote in your browser's search results, it will show the modern Wikiwand interface.
Wikiwand extension is a five stars, simple, with minimum permission required to keep your browsing private, safe and transparent.