(upbeat music)
- Good afternoon.
(audience mutters)
Welcome to the Allen School Colloquium.
I am delighted this afternoon to introduce to you
Jacob Eisenstein who is an assistant professor
at Georgia Tech.
Jacob is a leader in applying machine learning
to the area of computational social science
and what that is, he'll tell us shortly.
This is a really exciting area
and Jacob has been a real leader in this space.
I believe he's taught the first course
on computational journalism.
I'm not sure if we're gonna hear about that this afternoon,
but you can ask him later offline.
Before he went to Georgia Tech,
Jacob was at Carnegie Mellon University as a post doc.
Before that, he did his PhD at MIT,
and before that he was an undergrad
in the symbolic systems program at Stanford.
We are really happy to have him here today.
Let's give him a warm welcome.
(applause)
- Okay, thanks a lot Noah.
Yeah it's exciting to be here, I was here in 2015 for the
Jelinek summer workshop on language technology
for six weeks, that was a blast.
I had a good time getting to meet some of you this morning and
look forward to meeting more of you this evening and tomorrow.
That didn't work.
Okay, so.
Today, these are two headlines
from the last month or two and it's
I think symptomatic of a trend that tech companies
see themselves beset by challenges
that are not really purely computational challenges.
These are essentially societal challenges,
yet they're existential challenges for some of these companies.
I think the reason this is happening
is because computation has really succeeded
and computation has made itself ubiquitous
in our social lives and in our political lives
and this is what has led to new problems like
echo chambers and bots and viral hoaxes and hate speech.
As well as bringing new challenges,
the fact that computation is now everywhere in our lives
also offers new opportunities.
Opportunity to measure and understand social phenomena,
address social challenges at scale,
and perhaps by leveraging these opportunities,
that'll contain maybe the seed
to solving some of these challenges as well.
This is the setting in which I'd like to
sort of situate computational social sciences
as an emerging discipline
which applies computational methods
to research questions and to theories
that come from social science.
There are many ways you could imagine computation.
You know, computation plays a role
in all kinds of research endeavors now
and there are many ways you could imagine
computation playing a role in social science.
There's the distinction that I find helpful,
this comes from Princeton sociologist,
Matt Salganik, he distinguishes between what he calls ready-made approaches.
So ready-made approach is like you write some high quality software
and you sort of throw it over the fence and people,
social scientists are increasingly computationally savvy themselves,
and so you write a piece of software like AllenNLP
and that allows social scientists to do text analysis
that's relevant to their data.
Or you write a piece of software like gephi
and that allows social scientists to do
social network analysis that's relevant to their research.
This is sort of one approach
to computational social science.
It's not the one that I take.
The approach that I take is the alternative,
which Salganik calls a custom build.
This is sort of a one-off solutions, like a bespoke computational solution
to a social science problem.
Ideally, this is something that sort of
existing computational methods could not do before,
and so you have to innovate in computer science
in order to get the result and the method that you desire
to achieve your ends in social science.
I think these two approaches to computational science,
of course they're complimentary.
The hope is that you build enough of these custom solutions,
and maybe people abstract them
and are able to integrate them into something
that's more of a ready made tool
for the next generation of social scientists in the future.
The projects that I'll talk about today,
all three are sort of in this custom build vein.
These are things that, before we did this research,
they could not be accomplished
with existing computational techniques
and we had to innovate in computer science to do them.
The stuff that I'll tell you about too I guess
each of these projects is publishable in computer science,
and in many cases also publishable in a social science discipline as well.
Three pieces to the talk, in terms of the concrete projects.
These sort of, I tried to pick kind of verbs
to start each of these bullet points
that could sort of tell you something about
the roles that computation can play in social science.
Exploring data, operationalizing constructs,
and I'll talk more about what that means
when I get to that part, and then measurement,
particularly measurement of causal effects.
Let's jump into the first piece,
exploring the construction of
what I'm calling social meaning in networks.
Social meaning, that's the way that we use language to create and affect
the social relationships that we participate in.
I think a great example of that is the use of address terms in language,
so an address term is something like
Mr. Lebowski or The Dude, and so the sort of,
the joke with this movie is that he's sort of this laid back guy but
he's sort of insistent that everyone that he interacts with
interacts with him on his terms, and his terms being
that you can only call him The Dude because
that's the sort of social relationship that he wants to have with people.
From a linguistic perspective,
you could ask how address terms like Mr. Lebowski and Dude,
how do they create social meaning?
But the social meaning that we create
with the people that we talk to, it's not sort of purely
one relationship at a time, one dyad at the time.
Actually, each relationship that we have,
well that's situated in a larger network that we have to participate in
and it's not the case that we can just make
a series of independent decisions about how we wanna relate to people.
There has to be some kind of holistic structure
that emerges from that as well.
In this sense, what's needed here is a model
not just on the tech side, you could do something
with a topic model for example and hope to get topics
that relate to social media in some way,
and that's something just on the network side,
so you could look up the network side and you could say,
"Oh you know this particular dyad has a lot
"of mutual friends, it's a strong tie.
"This particular dyad doesn't have any mutual friends,
"it's a weak tie."
But what we really want is a model
that integrates both of those two modalities
and makes some kind of inference from the two of them
that you couldn't make from either of them individually.
Okay so I'll try to formalize this a little bit more.
The setting here is that we observe some network structure,
so we know who talks to whom,
and so this network structure is the undirected graph.
Then on each edge in this network,
I have a vector of counts of linguistic features.
The counts could be, well they are in this research,
for each possible way of addressing somebody.
We have a pre-processing step for building that lexicon
that I won't have time to talk about,
but you could ask me about that.
One each one of these dyads, I have a count
for the number of times in which
each possible reference term has been used.
Then the idea is that these counts
come from some kind of latent variable model
where each of these dyads has some kind of latent relationship,
that's why I'm using the symbol Y for this,
and that's gonna generate the features that we observe.
Specifically, in the formulation here, the feature counts are just drawn
as from a multinomial distribution,
so this is a distribution over counts of events
where the parameter of this distribution is a vector of frequencies
and so you expect that the relative frequency
of any particular event in the vector of counts
is approximately equal to the, well the expectation
is equal to the frequency in the parameter.
If we could estimate theta, that would give us
a distribution over the address terms for each edge type
and in a sense, that would tell us what each of these,
what the different settings of the latent variable meet.
Okay so what I've given you so far really hasn't
used the network in any interesting way,
it's just really a mixture model over dyads.
When I wanna bring the network into the picture,
you can ask yourself okay, are some label configurations,
some ways of arranging the latent variable on the network,
are some configurations better than others?
Why would we think that that would be the case?
Well if we look at sociology,
there's a really interesting theory called structural balance theory
and it describes networks of a specific type.
In structural balance theory,
the labels on the edges of the network indicate whether
each dyad is a friend of an enemy.
Then the principle of structural balance theory
is that certain triads are stable and others are unstable,
and if a triad is unstable it means
that it's likely to, over time,
to switch to one of the stable triads.
The sort of consensus principle is that
this particular triad of configuration is unlikely
because these two people are enemies,
yet they have some friend that they have in common
and so either one of them is gonna convince the friend
to join their side and defeat the enemy,
or maybe the more positive outcome is that the friend
convinces the two enemies to put aside their differences.
That's the point of view from structural balance theory
and what's cool about this is that this is a theory,
as articulated here, that's just on the level of triads,
but in fact what emerges from that is that if you have structural balance,
then in fact there are global properties of the network that emerged from that.
So if structural balance holds,
then the entire network partitions into factions basically
and that's something that you can prove just on the bases of these triads.
Now this is a particular type of network with a particular type of social meaning,
and that may not be the most relevant thing
when we think about the social meaning
of address terms like Mr. Lebowski and Dude.
In the case that I wanna deal with,
the magnitude and the direction of the effect
for each triad type is unknown.
Here I'm considering this is the case where there's only,
the latent variable has to be binary, it can only take two possible values,
and these are the four triads that could emerge after rotation and
the stability of this triad is not known in advance,
so this is just gonna be another parameter that we have to estimate.
You can view this as a form of structure induction,
so what we're doing is we're adding
essentially a prior piece of information that says there's gonna be some score
for the configuration of labels on the network
that's sort of prior to what we infer from the text.
That score is gonna be based on the counts of triads
that we have in the network.
I'll formalize this a little bit more in terms of
the math behind the prior distribution.
It's essentially a log linear model.
This is again, a prior distribution over
labelings of the network, that's what the bold Y means.
Again, G is the network itself.
And eta and beta are gonna be parameters.
First I have features on dyads and I have weights of those features.
A feature on a dyad might tell you,
okay these two people have 10 mutual friends,
so that's a feature of the dyad alone.
Then I have a set of triads in the graph, and I have features for each row,
I'm sorry, I have parameters for each rotated triad type, so it's this,
for each of those four triad types that I showed you before,
I'll have a different parameter,
and that's something that I'm gonna have to estimate.
Then I just have a normalizing constant
to make sure that this thing sums to one over all possible label links.
Yeah.
- [Audience Member] Quick question.
(mumbles) all the individual nodes?
The properties of individuals are gonna be ...
- We'll put that in the likelihood,
so we'll put that in the likelihood I guess.
Because we're not assuming we know anything about the nodes
in terms of additional covariates,
but there's no reason you couldn't incorporate, you know,
yeah if you had some other sort of, like the age or something,
you had some other kind of covariate information,
you could put that in too, yeah.
That's the prior and then,
and this factors over dyads and triads of the network.
I already gave you the likelihood that just tells you
that the distribution of linguistic features
is multinomial indexed on the relationship type.
The joint probability just is the product of these two terms.
In fact, this model allows us to answer a lot of questions about networks.
You could ask what is the relationship of each dyad?
And that's what you get from doing inference on Y.
You could ask how are social relationships expressed in language?
And that's what you get by looking at the parameter theta.
Then you could ask what sort of structural regularities
emerge in these types of networks?
And that's what you get from looking at the parameters of the prior, beta and eta,
and again beta is the sort of stability of each triad type
and eta weights on features of the dyad.
- [Audience Member] Say it again, what is X here again?
- X are the linguistic features.
X is like the counts of different forms of address.
- [Audience Member] And then Y is ...
- Y is the latent, the unknown latent label of the dyad.
(audience member mumbles)
The graph structure is observed, yeah, yeah.
We know who talks to whom. (audience member interjects)
We just know it, yeah we just know it.
It's not a random variable.
Okay and so we're gonna apply this model to a data set of 600 movie scripts.
Movie data is super interesting for this kind of stuff
because the types of relationships and
the types of address terms people use is incredibly diverse.
You can imagine like a court room drama
versus like a movie about a bunch of guys on a submarine,
versus like a medieval history kind of thing,
so all kinds of different forms of address can emerge in these movies,
so kind of interesting scenarios.
The networks in these sorts of movies
range from like five people to 30 people or something,
so relatively small networks.
Before we get to results on that data,
there are just sort of two computational problems
that emerge with this model.
First of all, just op-,
even if I was given the parameters of the objective,
finding the optimal labeling is NP-hard.
There's a proof that I think is reduction from vertex cover for that fact.
To handle that issue, we iterate over, so we make a mean field relaxation.
Essentially for each dyad, I have a variational parameter
that says what's my belief about what the label is for that dyad,
and then we iteratively update just
this sort of variational approximation
that's a product of those individual potentials.
Learning is also hard.
I've introduced this normalizing constant Z.
This requires summing over all possible labelings of the network,
there's no way to do that efficiently either.
We get approximate gradients by applying noise-contrastive estimation,
this essentially means that we sample
labelings of the network that are incorrect
and we get an approximate gradient by comparing against those.
This learning I should say is sort of the inner loop of an EM because
we don't observe label data at all,
so this is the inner loop of an EM style algorithm
so in the E step, I'm updating these distributions over,
I'm updating these Q distributions over dyad labels,
in the end step I'm updating my parameters.
For sort of a face validity, when I run the model forward
with two possible settings for the latent variable,
these are the two sort of clusters that emerge
and I'm showing the words that are particularly indicative of each cluster.
I think some structure kinda really stands out here on the cluster on the left,
these are sort of high formality kinds of terms.
The cluster on the right, these are informal kinds of terms.
Reviewer, like raised a question
about this term son being in the formal cluster.
That reviewer was not from the South in the United States.
(audience laughs lightly)
Where it definitely speaks to sort of
a power of asymmetry I think to use that term.
This is sort of a face validity just to look at it yourself.
We did something a little more quantitative than that.
We did an intrusion detection task,
so what this means is that we bring in raters,
just people, and we show them three terms from one list
and one term from the other list,
that's the intruder, and we ask them to figure out
which term is the intruder term.
What's interesting is that, so in the full,
when we run the full model, they're able to find
this intruder term in 73% of the cases.
The chance rate would be 25%.
When we show them the model that we learned
without the structural prior on the network,
so without this sort of complicated distribution
over possible labelings of the network,
they only find the intruder 52% of the time.
Incorporating, there's no additional textual information,
just incorporating this sort of structural information
about the network actually means that the model
makes much better inferences about text.
Yeah.
- [Audience Member] How are the participants
given the information on the right?
- We just show them, like I pick three terms
from the left list and one term from the right list,
shuffle them up, and say which is the outlier
out of these four?
And I do that several times.
- But when you're switching (mumbles) the model,
you get the 72% versus the 52%?
- So the full model, I get 73%.
When I give them a crippled model that doesn't
know about the sort of structural features, it--
- What do you mean by give them a crippled model?
Do you just give them these terms from the model?
- That's right, that's right, yeah.
Sorry, output from two different models, yeah.
- [Audience Member] This is (mumbles)
- [Jacob] Sorry?
- [Audience Member] This is a real, you're training us on that movie data?
- Yeah, just trained on movie data, no supervision at all.
Then we bring in, just bring in people.
- [Audience Member] And a string model rather than this one it can ...
- The textual model is basically a clustering model,
it's basically EM clustering.
It's multinomials and there's not a lot more to it.
We can show these pictures that emerged for specific films.
This is the first and original Star Wars, A New Hope.
The dotted edges I think are from
the informal cluster of terms that emerged.
The solid blue edges are from
the formal set of terms that emerged.
I think Vader is not in this picture because he doesn't
actually talk to any of these characters in the movie,
but the way the data set is constructed,
they only show you data where it's just two characters talking to each other.
If it's like a group of characters talking,
then we don't see that in the data
because we wouldn't know what to do with that anyway.
Yeah so what's interesting, you get the sort of triad,
the triads are all sort of informal and then the more, the more formal edges
are between these sort of outlier individuals.
This is a bigger picture, this is from the movie Ghostbusters.
The original Ghostbusters, if you don't remember,
are Venkman, that's the Bill Murray character,
Stantz is the Dan Aykroyd, and Spengler is Harold Ramis
and so they're informal relations together again.
Formal relations are with Dean Yaeger,
fires the Ghostbusters from Columbia
at the beginning of the movie, that's a formal relationship.
Of course the bad guy is Peck which is the evil EPA regulator,
and that's a formal relationship as well.
Last sort of piece about this,
these are the weights on the triads that emerge.
I'm labeling the edges as t and v,
that actually is sort of a reference to the sociolinguistic theory
about forms of address.
You can think of it as like (speaks in foreign language)
and (speaks in foreign language) in French,
so t is the informal.
I guess it comes from Latin, so (speaks in foreign language)
and (speaks in foreign language).
The triads that are preferred, that the model learns to prefer,
it prefers a triad of everyone on informal terms.
It gets a positive weight,
it likes a triad with everyone on formal terms.
It dislikes both of the heterogeneous triads,
but the one that it, it's sort of more tolerant of this one.
I didn't really understand why this triad was better
than this one but my graduate student explained it to me.
He said, "Look, in this triad, "like the guy in the bottom right,
"that's the professor and these two are the students.
"So the students are informal with each other
"and they're formal with the professor."
He had sort of a story that he could tell whereas
with this triad, it doesn't really make as much sense.
That's sort of the story with that paper, the address terms lexicons that we built
were then adopted in a later project from Stanford
on racial disparities in the language
that police officers use during traffic stops.
It's fun to show pictures about Ghostbusters and Star wars
but actually there's some I think implications
that come from this kind of work that are more urgent.
I should say this project is part of an ongoing effort
towards understanding social meaning in language,
so address terms are one facet of that,
but there are other devices that people have
to create and to modulate social meaning in language.
This is a project that's being carried on
by my PhD student Umashanthi and associate linguist
at the University of Pittsburgh, Scott Kiesling.
That's it for the piece on social networks,
so I'm gonna move to the next part of the talk
unless there are more questions now.
Okay so in this part of the talk,
I wanna talk about this concept of influence..
In general, when we people do social science,
there are these sort of constructs
that I think we have a lot of intuitions about
and we'd like to theorize about,
but it's often hard to think about relating those concepts
to things that we can objectively measure in data
and that's what I mean by operationalizing these constructs.
Influence is a good example of something that everyone sort of agrees exists,
like everyone believes that there are people
that are influential or people that are not influential,
where their influence explains how things happen.
It's not so easy to say what influence is.
One sort of account of this that I really like
is from this sociologist Pierre Bourdieu.
He talks about a particular form of influence
that he calls symbolic power and it's the ability to control how others use language,
so to determine what the boundaries of legitimate language are.
Sort of another take on what Bourdieu means by influence
is to think about language change and explanations for language change
as revealing something about who holds sociocultural influence.
Who is it whose language is able,
who is it whose able to shape the direction
of the language that other people use?
Now this is something, you know, this is something that we can observe
and it's something that sociolinguists have been
very interested in observing for a long time.
The sort of typical sociolinguistic methodology
for understanding language change is to focus on changes that take
multiple generations to really come into play,
so these are typically phonetic changes in terms of how vowels are pronounced.
There are things that you can learn from looking at change on that level,
but if we could observe language change that was more rapid,
we could form maybe a more fine grain model of
the social processes that really underlie language change.
One thing that I think really made this kind of really sort of transform the way
we could do this kind of research is social media, so Twitter in particular.
When we look at Twitter,
there's just a huge number of phenomena that
we can sort of see in real time a word go from never being used to being
quite popular throughout Twitter.
This is one of my favorite examples, is this ctfu,
the c is for cracking and the u is for up.
I like this example a lot because I think I've seen
the first tweet where this word is ever used.
I think this tweet is from Cleveland, Ohio,
I don't know if you can see the red dots
are geo tagged tweets that contain this word.
I think the first tweet ever was Cleveland, Ohio in 2009.
By 2010, you see that this word is being used
in Southwestern Pennsylvania, in Philadelphia,
and a little bit in D.C. and New Jersey.
By 2011, it's spread to other cities like
my hometown of Atlanta, Chicago, and San Francisco.
This is an example of language change
on a much more rapid scale than the sorts of language change
that are typically studied in sociolinguistics.
This is I think something you could look at
a phenomena like ctfu and say this is kind of trivial,
like ephemera, like why is this important for social science?
But think about what it takes for a word like ctfu
to go from a single tweet to being used by thousands of people per day.
Like what would it take for you to use this word?
Probably you're not gonna think of it yourself,
probably you're gonna need to be exposed to it first.
Just the fact that you use it tells me
you were likely exposed to it through someone else
and that tells me something about who you're connected to socially.
You're all in this room and so now you've seen it,
so now you've all been exposed to it.
Perhaps you're all gonna start using it yourself or perhaps not.
There's also a social decision that's made here,
a social evaluation that's made.
Is this a term that I want to use?
Do I wanna be perceived as someone is perceived
if they use a term like this?
It not only tells us about who talks to whom,
it tells us about who listens to whom.
Now ctfu is one example, and if you look just at this example
you would conclude that Cleveland
is the most influential place in the United States.
(audience laughs)
As it turns out, not the only such example,
and they don't all start in Cleveland.
Each of these examples, I think there's a lot
of sort of contingent stuff that happens
that shapes the trajectory of these innovations
in terms of which ones succeed and when they're adopted.
What we like to do is, you know,
each one of these idiosyncratic contingent things,
we wanna sort of abstract over many of them
and figure out what the sort of underlying stuff is
that really seems to reliably predict
the trajectory of many of these innovations.
What's the aggregate picture?
We were able to find actually several thousand
of these words because frequencies change
fairly dramatically over a three year period.
We took this data and then we aggregated it
both spatially and temporally.
Spatially, we're just looking at geo tagged tweets in the United States.
We aggregated into 200,
actually the 200 largest metropolitan areas.
Temporally, we look at just week by week and we do that,
we have data for a little more than three years, so 165 weeks.
Now we have this big tensor
where we have words, we have metropolitan areas,
and we have weeks.
We wanna model this as an auto aggressive process
or as a linear dynamical system, well dynamical system.
Specifically what I'm gonna say is that
the count for each word in each week in each city is
drawn as a binomial random variable.
The first parameter of that random variable
is just the number of tweets that we have from that place at that time.
The frequency parameter has to do with some latent variable,
so it tells me the latent activation of this word in this place at that time.
What we imagine is that that latent variable evolves as an auto,
as the first order auto aggressive process.
All of the activations in each city at time two
depend on the activations in each city at time one
and this process just goes forward in time.
If we can recover this dynamics matrix A, the parameters of that matrix
tell us which city is talking to which other city.
This is a notion of influence, it's at a fairly course grain level,
so it's not person to person influence,
I'll get to that in a few slides, but here it's city to city influence.
We can do inference in this kind of model, we use a Sequential Monte Carlo technique,
again sort of interleaved with expectation and legitimization.
The E step is Sequential Monte Carlo
and then that gets us at a approximates set of expectations
which we use to do estimation on the parameters.
This is sort of a discretized version of that A matrix
that emerges from the model.
These are the city to city connections that we're particularly confident about.
I wanna emphasize that the model does not know anything about geography,
so the actual locations of these cities are not encoded in the model at all.
The picture that emerges, geography seems quite prominent.
We have a really dense network of connections
on the West coast, dense network connections
in the Mid-Atlantic, and only a few connections
that really span large parts of the country.
This might lead you to conclude, oh and these are mostly mutual connect,
almost all mutual connections.
In almost every case, Seattle influences Portland,
Portland also influences Seattle.
A few of these are asymmetric and I'll come to that on the next slide.
The question i think you could ask here is like,
okay geography plays a role.
Is it the only thing that we need to know about?
Is geography sort of the whole story?
It would make sense, right?
I mean people's social networks
are generally geographically quite locally anchored.
I think actually all of us in this room
are probably exceptions to that in the sense that
as academics many of us are maybe far from where we were born
and our social networks are maybe more spread out.
For typical people, we've looked at this for typical people,
their networks are generally quite geographically compact.
Maybe geography is the whole story.
Well maybe not.
We try to look and see what is it about the city pairs
that are connected, what is it about those pairs of cities
that distinguishes them from other random pairs of cities?
Geographical distance is a strong predictor.
The further apart two cities area, the less likely
they are to share a strong coefficient of influence,
but it's not the most important thing.
The most important thing is demographics and specifically racial demographics.
The more dissimilar two cities are in terms of racial demographics,
the less likely they are to share linguistic influence.
The more similar they are,
the more likely they are to share linguistic influence.
This I think really fits a picture
if you know much about American dialectology,
the sort of primary differentiator in American English
is the difference between African American English
and White American English.
Geographical differences are much smaller on
any relevant dimension than that primary difference.
These are all symmetric effects.
There are also asymmetric effects,
so what cities tend to lead and what cities tend to follow,
and these are more common sense.
So I think larger cities tend to lead and younger cities tend to lead.
Yeah.
- (mumbles) is different if we just use the stronger model
where you just count the number of tweets back and forth?
- The number of tweets, what's the tweet ...
- Like for example,
two cities that respond to each other's tweets a lot.
Or two sub-communities within the city, right?
Would you predict the same with them or would it be different?
- You can think of other ways to construct networks of cities,
and so like who replies to whom or who retweets whom.
Might be similar, I haven't done it.
Yeah, it's interesting, it's an interesting idea.
I mean I think, yeah I think, so it's likely that
what we see here reflects communication patterns like that
so that, you know, I have every reason to believe
it would tell you something related to that, yeah.
- Okay maybe it's more of a discussion point but
this sorta gets back to I think (mumbles) question
about covariance, like if you think of statistically
what you would like to do is correct for all kinds of effects in the population
so (mumbles) theory, you'd like to include some covariance
and see if the residual effect of these things
are still there if you include geography.
So it's not just running it separately it's like ...
- So I'm running like a multiple regression here.
- You are including ...
If you have ...
But you don't have an effect for like
how connected those two cities are, like unconditional--
- [Jacob] Yeah.
- And the number of tweets.
- I don't know that I'm trying to say that it's sort of over and above,
like that could be part of the explanation, right?
Is that people, probably it is part of the explan-,
like people know each other more and that,
you know, who knows whom is part of the story for sure.
- If it's not over and above,
that seems a little strange that discuss the fact, right?
Because if you're saying people in other cities
talk to each other more,
I mean people in certain cities talk to each other more
and then you get these effects downstream.
That seems like a little less.
- I mean, right.
So what I guess ...
- Probably it's, you know, like, what are you trying to ...
(chuckles)
- In one sense, you're not gonna get
any asymmetric effects from that point of view, right?
So if I just look at which pairs of cities talk to each other,
that's a purely symmetrical sort of model.
You won't learn who's a leader and who's a follower that way.
So the question is like, you know,
if I think cities with similar racial demographics
talk to each other more, do I expect even an additional effect for race
when it comes to shaping language change?
I don't know.
Maybe it's important to show that, I'm not sure,
I'd have to think about that.
This was a picture where I was thinking about influence in terms of
a course grain level of city to city influence.
What we'd really like to think about
is person-to-person influence, that would be even,
if we could really get down to the level of like
what person influenced what other person?
You can think about this almost as like a detective story.
I see person i use this word for the first time at time t.
Who's fault, who is responsible?
Where do I put the blame?
Did they just think of it themselves or is there some particular individual
that exposed them and led them to adopt it.
Now the results I just showed you on the previous slide
are based on a free sample that you can get from Twitter,
you can get roughly one percent of Twitter data from a streaming API.
A sample like that I think is fine
when you're talking about aggregate inferences about cities,
but it's not gonna work for this kind of
person-to-person influence analysis, right?
This is very unlikely that I see both, if I have a small sample of twitter data,
that I see both the event at IT and the event at JT prime,
it's very unlikely that both of those survive sampling.
To study this question,
I built a collaboration with researchers at Microsoft Research in New York
where I was able to get complete public records
for four million American Twitter users.
When I say complete public records,
I don't have their tweets, I just have record
of exactly when they used specific words.
I had to give them a list a words and they came back and gave me time stamps
for when they used those words and they gave me social networks.
I have two slides on this but I think
in the interest of time I'm just gonna give you one.
The first question you could ask is like does the network that I get,
the network of who follows whom on Twitter,
does that tell me, you know,
is linguistic influence spreading on that network
or is everything we're observing coming from somewhere else?
You could imagine people see these words on TV or something
and then, or they get them over email or in text messages,
is the social network even relevant?
To answer this question, we do an epidemiological style analysis.
Specifically what we look at,
we sort of treat it as like getting a disease.
What's your chance of getting the disease spontaneously?
What's you chance of getting the disease
given that you've been exposed to the disease
by someone on the network.
We're gonna compare,
the relative risk is the ratio of those two probabilities.
I can compute the relative risk but there's still the possibility
that there's some lurking confound, that there's maybe,
I formed a friendship with somebody
because we have interest in the same internet slang and so
that would give the appearance that there's change happening,
or there's influence happening on the network when there is not.
You can correct for that by computing the same relative risk statistic
with a randomly rewired network,
so you're randomly, rewire all the connections
but you preserve sort of the overall degree distribution of the network.
Then you compute the relative risk in that network as well.
If the relative risk is much higher
in your original network, well that tells you that
that's a stronger indicator,
that there really is influence on this network.
We looked at three different classes
of linguistic innovations and
in all three cases, we found a relative risk that was
slightly, but significantly greater than one.
What I think is interesting that emerges from this picture
is that there's a real difference
for one of these types of innovations from the other two.
What I'm calling phonetic innovations, these are re-spellings of words in ways
that somehow reflect the pronunciation of the word.
Maybe it's a intuitive spelling of the word
based on the pronunciation,
maybe it's a spelling that reflects
what you might wanna emphasize in the word,
or if you were to pronounce it out loud,
or maybe it's a spelling that reflects your own
sort of idiosyncratic pronunciation style.
In any case, the sort of risk profile as the number of exposures increase
is much sharper for these phonetic variables
than it is for the other two types of innovations.
That actually fits a picture that we have from sociology
about the adoption of behaviors that are socially risky.
When you want to adopt a behavior that you're worried
is gonna be negatively evaluated by people around you,
it takes more than one exposure for that to really be
a safe decision for you to make.
I think that's pretty reasonable in this case.
These phonetic spellings, you could just sort of
look like you're ignorant and you don't know how to spell,
so that would be a negative evaluation that you wanna avoid
and so waiting to have more, two or three or more exposures
before you adopt is a more prudent move in that case.
I had one more piece so I think I'm gonna skip this Hawkes process story
because I really wanna get to this causal inference question.
This gets back to some of these existential challenges
that tech companies are facing
that I mentioned at the beginning of the talk.
Hate speech is something that
many different social media platforms
have had to contend with in the last few years and
one approach that you could take to hate speech
is to try to find the forums where hate speech is present
and shut those forums down.
You can see the outcome of this going in like two possible ways.
You could say well this is gonna work because,
well you could say it's not gonna work because
if you shut down the forum that has hate speech,
the people that post there are just gonna take
their hate speech and post it elsewhere,
and actually it's kinda gonna be worse
because maybe the forum was sort of keeping it all in one place
where most people didn't have to deal with it,
and you shut down the forum and now it's out in everybody's face.
The other possibility is that the forum itself
somehow encourages more hate speech,
that the existence of a forum where hate speech is sanctioned
creates sort of an echo chamber effect where people use more hate speech
than they would otherwise use so by eliminating the forum,
maybe you'll reduce the amount of hate speech overall.
In 2015, Reddit closed several forums where
a lot of hate speech was present, and they closed these forums
for violation of their anti-harassment policy.
What this does it is enables a natural experiment to test
this question that I just gave you on the previous slide
about the effectiveness of this intervention.
I say it's a natural experiment,
it's not like a real experiment because they didn't do a randomized control trial
and close some of these subreddits at random.
They picked them for their own idiosyncratic reasons
so we don't really know.
We wanna do, we wanna essentially make a causal inference
from observational data here,
that's what happens in a natural experiment.
The language for talking about that kind of inference,
it comes from like medical experiments.
So there's a treatment group and a control group.
The treatment group got the experimental drug,
the control group got the placebo.
In this case, the treatment group,
these are user accounts that posted in the subreddits
that Reddit then banned,
so these were people that were posting in places
that had a lot of hate speech.
Now we wanna find a set of control individuals
that are as similar to the treatment group as possible
in all respects except for the treatment.
They're very similar, but they do not post
in these specific places that got closed by Reddit for using hate speech.
The way that we do this is we look at forums,
so we look at people that posted in the forum that got shut down.
FPH stands for fat people hate, that was one of these two forums.
These two blue circles posted in that forum.
They also posted in these two tan cir-,
in these three tan square forums.
We build a pool of possible control individuals,
that's people that also posted in these control forums.
Then to select specific individuals for each user in the treatment group,
we select one user from the control pool,
one user account from the control pool
who is as similar as possible on every dimension that we could measure.
And this is a matching approach
which is sort of a classic approach to causal inference.
In the end we get a treatment group, we get a control group, they post in,
they share the property of posting in a lot of these control forums together
and they're as similar on every dimension that I can measure as possible.
Now we have to measure the amount of hate speech that people are using.
I developed a technique back in 2011 for identifying key words.
It works pretty well for this type of data.
We're able to find key words that are unusually frequent in each forum.
We didn't wanna sort of label a bunch of posts as hate speech or not,
there are data sets like that,
I'm sort of skeptical about those annotations
for various reasons we can discuss offline.
Instead I just used the existence of the forum
as sort of a proxy to say what's sort of different
about the language in this forum
from language that is used elsewhere?
Now some of that is not hate speech.
People mention that, whatever forum people are posting in,
they mention the name of that forum more than other things
and so that's something we can remove from this word list.
There's sort of a set of words that have arisen around
just the act of posting offensive content
but those words are not really hate speech themselves.
Then I think the most difficult case for words that are
frequently used in hate speech argumentation
but also have uses in non hate speech discourse as well
and so we eliminate those words as well.
We take this list of 100 key words,
we do this manual analysis,
end up with about only 20 remaining.
I did this, my graduate student did it,
we measured the interrater agreement, it was quite high
so we were able to consense on what should be in this list
and what should be excluded.
Okay and so these are the results and
I'll take a minute just to walk you through this slide.
Each of these panels tracks the amount of hate speech in one forum.
The X axis shows time and at time zero,
that's when the intervention happens, so that's when the forum shut down.
Sorry, the counts are hate speech by these individual user accounts.
The forum shuts down, there's no more speech
of any kind in that forum
but the people that posted there continued to post elsewhere.
The Y axis is the fraction of tokens that match our hate speech lexicon.
On the top we have the manually filtered lexicon
that my student and I did.
On the bottom we have the original 100 key words.
We wanted to have something that was completely automated
so that's what's on the bottom.
- [Audience Member] So this is, on the Y axis,
this is the fraction of hate speech on a different,
on a different site or just the overall--
- In Reddit, but in different forums.
Reddit has many different forums, right, so--
- Right, right, but you're including the hate speech
they're using in the forums
that were shut down on the Y axis.
- Yeah, before the cutoff yeah.
Yeah, before the cutoff, yeah.
Then after the cutoff, those forums are closed
so no one posts there any more.
What we see, you know at the time of the treatment,
a dramatic drop in the amount
of hate speech that people used.
No sort of corresponding drop in the control group
which tells us that there isn't some exogenous reason
that hate speech was decreasing,
that it really does seem to be due to the treatment.
The left column is one forum
that was closed for hate speech,
the right column is another one,
so just two, they closed two forums at this time.
We wrote this paper.
A day or two after it came out
it was posted to Reddit.
It became the most popular story on Reddit
so one thing Reddit really likes
is our research paper is about Reddit it turns out.
(audience laughs)
So this is 47,000 up votes for this story.
I've told you the piece about the language side
but another sort of piece of the story
is that many of the accounts that were used
to post to these forums that had hate speech,
many of those accounts were abandoned,
so people abandoned their accounts at much higher
than the base rate after this policy went into place.
People on Reddit had a, whoops,
people on Reddit had a lot of interesting
questions and comments specifically about
the linguistics side of the research.
So how did you identify what qualifies for hate speech?
Did you bring you own sort of liberal bias or something
into this conversation?
And that's actually one reason
I felt it was really essential in the paper
to include both the sort of
unfiltered set of words that the algorithm gave us
as well as the filtered set of words that we produced.
I think the filtered words are more valid,
but I was just concerned that people would think
I was injecting my own bias into the filtering
and so I have a sort of unfiltered version as well.
- But how do you know that people
didn't go outside of Reddit?
Because--
- People certainly did.
People certainly did go outside of Reddit.
In that sense, asking whether it worked
you have to sort of ask yourself like what you mean by did it work?
Is your goal--
- Maybe worked for Reddit but not for society at large.
- Right, right, right.
I mean, you know ...
One thing that happened is people started
alternatives to Reddit that were very similar in structure
but were really dedicated to the principle
that there would be no control of content whatsoever
and many people that posted this kind of content went there.
Almost no one else went there.
So you can, you know, from Reddit's perspective, definitely a win.
From society's perspective, I don't know.
Yeah.
- Do you have a sense of what fraction
of these hate speech words occurred in the
just the sites that were shut down?
Because like wouldn't you just get some effect where
if you just closed all the sites down that had hate speech,
you'd just see a drop in the amount of hate speech?
Or like were there any sites that had
a lot of these words in it that weren't shut down?
- I see, yeah, okay.
Yeah, so there were a number of, yes.
- Your term, you would just expect a drop
is what I was trying to get at
because you just said--
- [Jacob] Although, I mean if--
- If everything's correlated with these sites,
I shut them down, and (mumbles)
- Right, so.
Right, so we had this sort of set of control forums
where people from hate speech groups tended to post
and if you look at what was going on in those forums,
there's a lot of hate speech in those places as well.
- Okay and maybe ...
You don't have the plot for how much hate speech
was going on in those other forums?
Like if you shut down the hate speech forums
do they sort of decrease the amount of hate speech
on other forums or does it stay the same or does it increase?
What am I missing?
- So it seems to decrease because nothing,
like once I cross zero on the X axis,
like the hate speech forums ...
The treatment forums, they've disappeared
so you can't post there anymore.
The total amount of hate speech that these people ...
Sorry?
- That's the total but it's not on those sites.
- I think what Sean wants is
the non shut down forums on the left side of that line,
how much of that--
- Because that's sort of where
it seems like the ...
- Yeah I don't know the answer to that,
yeah I'll think about that, yeah.
People were quite interested in knowing
how we picked these words.
One thing that happened shortly after the paper came out
is that Reddit decided this was a good strategy and
expanded it dramatically,
so they shut down many more such forums.
This is another causal inference question
that we won't know the answer to as to whether
our research paper had any role in that, don't know.
Sort of thinking about whether it worked and why it worked,
in terms of why it had the effect that it had,
Reddit has this kind of interesting federated structure
where individual forums have moderators
and those moderators generally have
more or less complete freedom to decide
what is acceptable in that forum.
If you like to post, like a forum that I go to on Reddit
is for bike mechanics.
Like if you like to post a lot of hate speech
and you try to post in the bike mechanic forum,
like the moderators of that forum will just kick you out.
They sort of delegated a lot of the moderation
to these individuals that control the specific forums
and then if you eliminate the forums where hate speech
is sanctioned, there are not a lot of places left
for people to go.
That's potentially one reason why this worked for Reddit
and would be maybe difficult to replicate
in a place like Twitter or Facebook.
The question was about whether people
go to alternative sites other than Reddit?
For sure some people did.
From Reddit's perspective, this might be a win.
From the perspective of people
that use Reddit frequently like myself, this might be a win,
but from like a, is the amount of hate speech
in the world overall decreasing?
Don't know.
Then I think our algorithms of course
only detect specific subsets of hate speech
and we really focus on things that we can detect lexically
in individual words.
I think it's always possible that just
what changed really was the character of hate speech
and people maybe expressed the same hateful thoughts
in different ways that were more difficult to detect.
That's always a possibility.
I'm gonna shift gears now,
I've been talking about research the whole time,
I wanna talk about a few other things.
As professors, that's not the only thing that we do,
fortunately or unfortunately.
One big part of what I do at Georgia Tech
has to do with building a community
around computational social science.
I've organized and co-organized doctoral consortium
at EMNLP at ACL
on computational social science and one case with Noah.
I've organized a series of consortium in the Atlanta area.
You know Georgia Tech is essentially a tech school,
we don't have a lot of social science research
at Georgia Tech, so I've organized consortium
bringing together people from Georgia Tech
with people from Emory and Georgia State.
We did that three different years.
I organized a panel on computational sociolinguistics
at the American Society for the Advancement of Science.
On the teaching side, I created a course
on computational social science
that I've taught three times now.
It's sort of a course, we have this structure where
every week we have a different sort of high level topic.
The students will read a social science paper every week,
so for computer scientists who have
never read social science papers before,
they'll read, discuss, present such papers every week.
Social scientists have taken the classes as well.
Every week we introduce a new computational concept.
We'll have a little lab session
at the end of the second lecture each week
where people work through a Jupyter notebook that I create
where they can sort of apply these methods.
Then the assignment basically is to take those labs
and extend them and do something creative and original
on their own for students.
There's a course on computational journalism,
I actually did not invent the course myself
although I redesigned it.
I took it over from Professor Erf-ahn-ee-suh
and this is kind of a neat,
it's essentially a data science course as well
targeted more at undergraduates
and from a journalism perspective.
Specifically targets, we have a program at Georgia Tech
in computational media
which is within sort of the family
of computational degrees that we offer but
considerably more diverse than our
computer science program for example at Georgia Tech,
so that's a fun group of students to bring these topics to.
Then I teach a course on natural language processing
that I basically redesigned from scratch
when I came to Georgia Tech.
I have a textbook under contract
with MIT Press on that topic.
That'll be done later this spring.
To summarize, I think, hopefully I've convinced you
that computation and social science,
if you didn't believe it already, are inextricably linked.
If I think about the progress
of computational social science,
you could maybe divide into sort of
a first wave and a second wave.
A lot of the early work in computational social science,
if you look at the paper from David Lazer et al.,
they focus on things like large scale instrumentation,
crowd-sourcing, and a lot of network analysis.
I think the next wave for computational social science
is artificial intelligence and machine learning.
I think being able to really measure what people mean
and operationalize social phenomena and make
more sensitive measurements really than what we could do
with just simple word counting.
I think that's really where
the field of computational social science is headed
and that's why I think UW would be an amazing place
to continue this trajectory.
I'll just give a little teaser.
This sort of goes both ways.
It's not just computation as a new tool for social science,
but also social science ideas coming back
and making computer science more robust.
I had a line of research with my PhD student Yi Yang on
using social network ideas to make
natural language processing software
more robust to language variation as well.
Okay so let me acknowledge my great students
and collaborators and sponsors
and I'd be happy to take questions from you, thank you.
(applause)
- [Noah] We have a few minutes for questions.
- Eventually with the broad process,
most of the computational social science
we heard today has been bringing
you know, all the data that we now have
of computational tools to study social concepts
and phenomena theories that we already have on our minds,
but an equally exciting or maybe even
more exciting possibility is we might develop
whole new concepts and theories and whatnot
coming from all the observations that we can now do.
Do you have any thoughts on that
and what those might be?
- Yeah I mean that was, I think that,
so I've used unsupervised machine learning a lot in my work
and I think that sort of drives,
I think that's driven by that sort of agenda.
If I have a data set, can I just explore it
and see sort of what structure emerges
and does that drive new theorizing about
what's present in that data?
I mentioned this collaboration with a sociolinguist
at the University of Pittsburgh,
we're essentially doing glorified factor analysis
again, on Reddit posts, but what we're
able to sort of pull out from that
are different sort of ling-,
what he calls linguistics,
like interpersonal stances,
so ways people have of interacting with each other.
Some of that fits existing theoretical constructs
like formality and respect and politeness.
Others have sort of challenged us to rethink
the sort of inventory that we had in mind.
- For the second part of your talk
when you were discussing the shift
for the phonetic variation,
did you consider the possibility
that changing the phonetic
or the spelling of a word would,
was actually potentially changing the meaning of the word
like due to say a news item or something like a pun
or something like that?
- Yeah, yeah, yeah.
I think, yeah, I have like a whole paper
on phonetic variation in spelling
or phonetically motivated variation in spelling
and I think like absolute,
so to be clear about what we were measuring there,
the assumption is that every spelling
is a new innovation basically.
Every new spelling is an innovation
so we don't, we're not sort of equating
an alternative spelling to the original spelling
in maybe the way that you would do
in like a typical variation of sociolinguistic analysis.
We're basically treating each spelling as a new thing.
In terms of what is the social meaning of alternative spellings,
I think that's a really rich topic actually.
As you try to connote maybe an informal way
of pronouncing a word or you try to connote
a heavily accented way of pronouncing a word
according to some sort of caricature to accent,
that's all stuff that can happen
in these phonetic spellings.
- Quick question, so what's been the
reaction of the more traditional social sciences to this approach?
I take it they have their own philosophies
and methodologies and this is, you know,
coming at it from a different direction
and so what has been that reaction like?
- I think ...
I've had the most engagement with sociolinguistics
as a field and you know,
I think the reaction has generally been positive,
so I've been able to publish three papers now in
this sort of sociolinguistics field specific journals,
Journal of Sociolinguistics, Journal of American Speech.
I've given talks on this stuff
at sociolinguistic conferences.
I think I would love to see more sociolinguists
doing this kind of work.
To me, that would really be, you know,
people can come and ask nice questions at the talk
but I'd really like to see them doing it themselves.
I think that's been a little slower
but I think it is happening.
Actually I've heard about some cool research here at UW
in linguistics that's going in that direction this morning.
I'd say political science is,
I think Noah's more plugged into that community.
They're fairly advanced actually
in the use of computational methods and
they can do things like write customized topic models
and that kind of stuff.
- [Audience Member] It seems like a lot of what you're presenting here
is really tied into what we're seeing in the news right now
with the fake news, the Russian postings,
the how does Facebook respond, corporate directions.
Is that kind of, am I correct in making that connection?
- I mean yeah, I think a lot of this stuff definitely
is relevant and of interest to corporate,
especially social media corporations.
I think, you know, their survival and their success
really depends on handling these issues well, absolutely.
Yeah I think, you know the problem that is
that I think it's difficult for them sometimes
to do this kind of research themselves
or to publish this kind of research themselves because
the results are not always things
maybe that they'll wanna share.
- [Noah] In the back.
- [Audience Member] Can you hear me?
- [Jacob] Yes.
- Back to the thing about cities adopting
different lexical items on Twitter,
how did you decide which features to look at
for how influential they were with each other?
- We basically had like a,
like you mean which linguistic features or which--
- Which features of cities?
Like how did you decide
to look at population or demographics?
Was that just kind of like we'll try and see what works?
- We got like, you know it was more like
what is easily available from the U.S. census,
so that was sort of the predominate thing and then,
you know, yeah so stuff that, right,
stuff that like easily available year to year,
has been used in other research
about what differentiates cities.
You can find all kinds of crazy statistics,
like number of plumbing fissures per household
you know we thought was probably not super relevant
but we wanted something that proxied
for socioeconomic status so we used media and income
but you know, there are like a lot of things you could do.
You could take that,
if you have ideas about how to take that further,
it's relatively easy to integrate
other covariates like that.
- [Noah] Okay it's 4:30, so we should thank the speaker.
(soft piano music) (applause)
Không có nhận xét nào:
Đăng nhận xét