image retraining

modern image recognition models have millions of parameters. training them from scratch requires a lot of labeled training data and a lot of computing power (hundreds of GPU-hours or more). transfer learning is a technique that shortcuts much of this by taking a peice of a model that has already been trained on a related task and reusing it in a new model. in this tutorial, we will reuse the feature extraction capabilities from powerful image classifiers trained on ImageNet and simply train a new classification layer on top. for more information on the approach you can see this paper on Decaf


though it's not as good as training the full model, this is surprisingly effective for many applications, works with moderate amounts of training data (thousands, not millions of labeled images), and can be run in as little as thirty minutes on a laptop without a GPU


before you start any training, you'll need a set of images to teach the network about the new classes you want to recognize


Bottleneck is an informal term we often use for the layer just before the final output layer that actually does the classification. (tensorflow hub calls this an "image feature vector"). this penultimate layer has been trained to output a set of values that's good enough for the classifier to use to distinguish between all the classes it's been asked to recognize. that means it has to be a meaningful and compact summary of the images, since it has to contain enough information for the classifier to make a good choice in a very small set of values. the reason our final layer retraining can work on new classes is that it turns out the kind of information needed to distinguish between all the 1000 classes in ImageNet is often also useful to distinguish between new kinds of objects


because every image is reused multiple times during training and calculating each bottleneck takes a significant amount of time, it speeds things up to cache these bottleneck values on disk so they don't have to be repeatedly recalculated


once the bottlenecks are complete, the actual training of the top layer of the network begins. you'll see a series of step outputs, each one showing training accuracy, validation accuracy, and the cross entropy. the training accuracy shows what percent of the images used in the current training batch were labeled with the correct class. the validation accuracy is the precision on a randomly-selected group of images from a different set. the key difference is that the training accuracy is based on images that the network has been able to learn from so the network can overfit to the noise in the training data. a true measure of the performance of the network is to meansure its performance on a data set not contained in the training data--this is measured by the validation accuracy. if the train accuracy is high but the validation accuracy remains low, that means the network is overfitting and memorizing particular features in the training images that aren't helpful more generally. cross entropy is a loss function which gives a glimpse into how well the learning process is progressing. the training's objective is to make the loss as small as possible, so you can tell if the learning is working by keeping an eye on whether the loss keeps trending downwards, ignoring the short-term noise


by default this script will run 4000 training steps. each step chooses ten images at random from the training set, finds their bottlenecks from the cache, and feeds them into the final layer to get predictions. those predictions are then compared against the actual labels to update the final layer's weights through the back-propagation process.


the first place to start is by looking at the images you've gathered, since the most common issues we see with training come from the data that's being fed in


for training to work well, you should gather at least a hundred photos of each kind of object you want to recognize. the more you can gather, the better the accuracy of your trained model is likely to be. you also need to make sure that the photos are a good representation of what your application will actually encounter. for example, if you take all your photos indoors against a blank wall and your users are trying to recognize objects outdoors, you probably won't see good results when you deploy


another pitfall to avoid is that the learning process will pick up on anything that the labeled images have in common with each other, and if you're not careful that might be something that's not useful. for example if you photograph one kind of object in a blue room, and another in a green one, then the model will end up basing its prediction on the background color, not the features of the object you actually care about. to avoid this, try to take pictures in as wide a variety of situations as you can, at diffetent times, and different devices


you may also want to think about the categories you use. it might be worth splitting big categories that cover a lot of different physical forms into smaller ones that are more visually distinct. it's also worth thinking about whether you have a 'closed world' or an 'open world' problem. in a closed world, the only things you'll ever be asked to categorize are the classes of object you know about. this might apply to a plant recognition app where you know the user is likely to be taking a picture of a flower, so all you have to do is decide which species. by contrast a roaming robot might see all sorts of different things through its camera as it wanders around the world. in that case you'vd want the classifier to report if it wasn't sure what it was seeing. this can be hard to do well, but often if you collect a largen number of typical 'background' photos with no relevant objects in them, you can add them to an extra 'unknown' class in your image folders


it's also worth checking to make sure that all of your images are labeled correctly


the rate of improvement in the accuracy slows the longer you train for, and at some point will stop altogether (or even go down due to overfitting), but you can experiment to see what works best for your model


a common way of improving the results of image training is by deforming, cropping, or brightening the training inputs in random ways. this has the advantage of expanding the effective size of the training data thanks to all the possible variations of the same images, and tends to help the network learn to cope with all the distortions that will occur in real-life uses of the classifier. the biggest disadvantage of enabling these distortions in our script is that the bottleneck caching is no longer useful, since input images are never reused exactly. this means the training process takes a lot longer (many hours), so it's recommended you try this as a way of polishing your model only after you have one that you're reasonably happy with


for example it wouldn't be a good idea if you were trying to recognize letters, since flipping them destroys their meaning


the largest is usually the training set, which are all the images fed into the network during training, with the results used to update the model's weights. you might wonder why we don't use all the images for training? a big potential problem when we're doing machine learning is that our model may just be memorizing irrelevant details of the training images to come up with the right answers. for example, you could imagine a network remembering a pattern in the background of each photo it was shown, and using that to match labels with objects. it could produce googd results on all the images it's seen before during training, but then fail on new images because it's not learned general characteristics of the objects, just memorized unimportant details of the training images


once training is complete, you may find it insightful to examine misclassified images in the test set. this may help you get a feeling for which types of images were most confusing for the model, and which categories were most difficult to distinguish. for instance , you might discover that some subtype of a particular category, or some unusual photo angle, is partucularly difficult to identify, which may encourage you to add more training images of that subtype. oftentimes, examining misclassified images can also point to errors in the input data set, such as mislabeled, low-quality, or ambiguous images. however, one should generally avoid point-fixing individual errors in the test set, since they are likely to merely reflect more general problems in the much larger training set
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章