[♪ INTRO]
If you'd taken an engineering job at Google before March 2017,
you would've had to sign an unusual waiver.
It was an agreement accepting that in the course of your work,
you might be exposed to adult content, including pornographic images.
That policy has since changed,
but the fact that it existed at all is a stark reminder that the internet doesn't censor itself.
Someone has to build the software that avoids showing naked people
to users who aren't supposed to see naked people.
Those engineers spend a lot of time designing, training, and testing algorithms,
the set of rules a computer follows to accomplish a task, for filtering adult content.
Which might mean poring over images that in any other setting would
be totally not safe for work.
And building those algorithms is no easy task.
Even we humans have a hard time defining pornography, as US Supreme Court Justice Potter Stewart
famously admitted when all he could say was "I know it when I see it."
But at least we can usually agree on whether there's a naked person!
For computers though, even that is complicated.
When you look at a photo, you see people, and desks, and chairs,
but all the computer sees is this.
Little blocks of color, a million times over.
Where in this fuzzy mess of pixels do the body and chair even start and end?
And let's say you've figured out whether you're seeing distinct objects.
How can you tell what they are, and whether one of them is a nude human?
After all, even two pictures of the exact same thing can look very different.
For example, four photos of limbs might look very different,
when they're actually all from the same person.
With the differences in lighting, angles, and stuff in the way,
it's hard to tell that they're all Michelle Obama's arms.
Then sometimes you have the opposite problem,
when things that are totally different look weirdly similar.
Take dogs, for example.
You might think you know a dog when you see one,
but then along comes a viral meme and now you can't be sure whether that's a
puppy or whether it's a mop, a muffin, or a marshmallow.
Our brains have evolved to do a tremendous amount of this work subconsciously.
In fact, around 30% of your brain's cortex is dedicated just to vision!
But engineers are starting from zero.
Somehow, they have to get from a description of an image as a collection of tiny dots
of color to a higher-level description: textures, shapes, objects;
all the imprecise, big-picture stuff that tells us humans what we're looking at.
In the prehistoric era of computing, we're talking the 80s and 90s here,
the generally accepted approach was to think really hard about what
features of an image might make for decent high-level descriptions.
And then you'd design specialized algorithms to extract those features.
Some of the most popular features to look for were things like edges,
contiguous shapes, and so-called keypoints,
pixels whose neighbors stay roughly the same if the image is resized or otherwise tweaked.
These let you summarize an image in a way that didn't change too much
with small adjustments like rotation or lighting.
The hope was that if you had, say, an image of a kitten, the edge contours and keypoint
arrangements should be fairly similar to other pictures of kittens.
So you could use those features to build a specialized algorithm to decide
which images should be returned when a user searches for kittens.
In the case of finding naked people, this style of image analysis was used in a foundational
paper from 1996, a few years after porn websites started to go live on the web.
The paper was titled, appropriately enough, "Finding Naked People."
The first step the researchers took was to identify possible patches of skin.
This meant finding pixels containing yellows or browns, maybe with some reddish tones.
Skin also doesn't usually show much texture, at least in porn.
As the paper comments, "extremely hairy subjects are rare."
So any pixels showing skin shouldn't vary too much from the areas around them.
Next, if at least 30% of an image was marked as possibly skin,
the algorithm would try to piece those pixels together into body parts.
It would group straight-ish strips of skin color into longer segments.
Touching segments could be paired into limbs, and limbs and segments could
combine to form either spine-thigh groups or limb-limb groups.
Finally, the system would check which groups were geometrically possible given human physiology.
For example, it ruled out configurations that could only be formed by someone
lifting their leg up and flipping their knee backwards.
Any groups that weren't eliminated were assumed to be naked humans.
Methods like this worked OK, but they had some serious downsides.
First, those handcrafted rules were super brittle.
For example, humans don't usually position their trunk between their thighs,
but it can happen, especially in porn.
If it does, the rules might not recognize a nude person.
Similarly, there were a lot of finicky thresholds and settings to be fine-tuned.
Like, why is 30% the minimum amount of skin to accept?
Why not 15, or 35?
Also, to make improvements,
engineers would have to rethink all those custom-designed algorithms and how they interact.
And most importantly, it was entirely up to engineers' creativity to come up with high-level
descriptive features that were effective, and those features could only be as subtle
or detailed as the engineers were willing and able to code up.
So the criteria for nudity would be very rough and involve sometimes comically naive approximations.
I mean, looking for "blotches of skin color with plausible geometry" misses some of
the more, uhh, obvious features of naked bodies.
You know, like breasts and genitals.
Over the past few decades, though, a new method has started to take over the world
of image processing, including adult image detection.
It's called a convolutional neural network.
The core idea is that instead of manually defining which higher level features are important,
you can build a system to figure that out for itself.
You show it thousands of training examples:
pictures labeled safe for work and not safe for work.
Then, you let it find recurring patterns, like spots of contrast or color,
and piece together how those patterns combine into bigger patterns like lines and edges.
Then it can learn even bigger patterns like skin textures and hair against skin.
And then it can start to recognize things like nipples and belly buttons
and guess whether it's seeing a naked person.
The underlying technology here is the same thing that's driven many recent advances
in artificial intelligence: deep neural networks, or DNNs.
DNNs are loosely based on networks of brain cells,
in a "based on a true story" kind of way.
A DNN contains neurons in the same sense that a game of SimCity contains factories:
the software simulates a very crude version of the real-world thing.
The simulated neurons are arranged into virtual layers.
Each neuron gets a bunch of inputs, for example,
the colors of some pixels or the output from the previous layer.
Then, it performs a simple calculation based on some internal settings,
and passes the result on to the next layer of neurons.
The last layer's output is the network's best guess at an answer.
As the network sees each training example, it guesses what it's seeing.
If it guesses wrong, it twiddles the settings on each neuron so that
the error is less likely to happen next time.
There are a lot of kinds of DNNs, but convolutional neural networks, or convnets,
are the type most often used for image processing.
In the first layer of a convnet, each neuron examines one small tile of the input image,
and outputs how strongly that tile matches a simple image template,
maybe a blob of pink, or a spot of contrast between light and dark.
That template is what gets learned when the neuron's
parameters are updated as the network is being trained.
And actually, there's a whole grid of these neurons, one for each tile in the image.
All the neurons in this grid update their settings together,
so they all learn to match the same template.
This is where the "convolutional" part of the name comes from,
applying the same template detector to each tile.
Now, within that first layer, there might actually be dozens of grids like this,
and each of them learns to match a different template.
So overall, what that whole first layer outputs is how
strongly each template matches at each location in the image.
The second layer is similar, but instead of looking for patterns directly in pixels,
it looks for patterns in the color blobs and
contrasts and all the other output of the first layer.
Because each second-layer neuron grabs inputs from a whole bunch of tiles from the first layer,
it gets information from a bigger swath of the original image.
The same thing continues up the hierarchy: each layer looks for patterns in the patterns
detected by previous layers, until finally the highest layers
end up looking for naked torsos and groins.
Convnets have a lot of advantages over the older methods.
They can check situations where there's maybe a penis close-up without engineers having
to guess that nude images would likely contain lots of those close-ups
and then custom-build penis-detection algorithms.
Another plus of convnets is that everything is based on a sliding scale of similarity
rather than hard and fast rules: each neuron is asking itself,
"How closely does this patch of the image resemble a line or a foot or whatever?"
That means the network can be flexible about integrating multiple lines of uncertain evidence.
Convnets also change the paradigm for how to improve a system: if it's getting too
many false positives on swimsuits, just feed in loads of swimsuit photos with training
examples, and let the network figure out for itself how to distinguish them from true nudity.
In fact, you could even take a convnet designed for filtering porn
and retrain it to detect doge memes.
Now, that doesn't mean a convnet is all-powerful.
There are still lots of task-sensitive choices the engineers have to make about how exactly
to structure the network, what size tiles to use, and so on.
So you might still have to do some surgery on your network if you're trying to catch,
say, tentacle-filled adult anime, which has very different characteristics from photos.
Convnets also don't help as much when you need social context
to recognize what makes an image not safe for work.
For example, it's a lot harder to build a detector for images of human trafficking,
because you can't just look at the pixels; you need a lot of background knowledge
about what's actually happening in the photo.
But for the most part, convnets get really good results.
They've revolutionized image processing,
from image search to vision for self-driving cars.
And flagging sexually explicit content is one of the most visible applications,
or maybe the least visible.
Because if adult image detection is working, most of the time you don't notice.
So the next time you're searching for pictures of nude tights and aren't bombarded with porn,
take a moment to appreciate the algorithms, and the engineers, that make it all possible.
Thanks for watching this episode of SciShow!
If you're interested in more deep dives into complex topics like this one,
you can check out our technology playlist over at youtube.com/scishow.
And don't forget to subscribe!
[♪ OUTRO]
Không có nhận xét nào:
Đăng nhận xét