Game Of Math: The MaxEnt Algorithm and Game of Thrones.

 

Game of Thrones.

Surely you're wondering how does Mathematics fit in Game of Thrones? Well I confess, I could use any other TV series to tell you what follows in this post, but honestly ... what is best than Game of Thrones (GoT)?
The connection to Mathematics is not so much considered in the series but in the type of data used. In particular one kind of data that has been used is the set of the comments regarding GoT posted on the 
IMDb site reviews to produce this infographic, on which I worked with my brother.

got

 

Infographic.

Firstly, very briefly, let us see what types of information were extracted from various comments (671).

The map on the left shows the major lineages; each one has been placed in its city of reference, indicating:

·         The number of reviews in which the single lineage is appointed;

·         The number of reviews that speak positively of the family;

·         The number of reviews that speak negatively of the family;

·         The member of the most cited lineage;

·         The not mentioned member of the family.

Instead, in the chart on the right hand side appear:

·         the performance of the nominated characters from each house in five seasons of the series;

·         the most cited characters that are not attributable to lineages;

·         the graphs of co-occurrences between characters and between lineages.

The part on the infographics which, however, I would like to focus attention is the one related to the word-cloud in the bottom right side of the picture, where the adjectives that have been mainly used in the reviews of a lineage are reported.
It needs to be explained how to determine whether a word in a speech has to be regarded as an adjective rather than as a preposition or article. And it is here that Mathematics comes to help.

To perform this type of analysis a POS (Part of Speech) Tagger has been used, namely a tool able to make grammatical analysis of a text. The POS Tagger taken into account is based on OpenNLP library, which is essentially based on the Maximum Entropy model that we will analyze in detail.

Information Theory and Entropy.

Before examining the MaxEnt algorithm, I would define the concept of entropy used here.
Most commonly we talk about entropy in the following areas:

·         thermodynamics: the entropy of a gas is function of its temperature, volume, mass and nature; it is a reversibility index: if there is no change, the process is reversible, otherwise it is irreversible (just think of the exchange of heat between two bodies, one hot and one cold);

·         statistical mechanics: the entropy is a measure of the increase of disorder (inability to predict the position and velocity of the particles); the greater is the knowledge, the lower the entropy, and viceversa.

To these interpretations of entropy, one can be added and it plays a very important role in information theory (especially in the field of signal processing) and will be the way in which we understand it in our case.

Let's consider a source of messages XX. The amount of information transmitted from the message increases with the increase of the uncertainty of the product message. The greater our knowledge about the message produced by the source, the lower the uncertainty, the entropy and the transmitted information.
We formulate the concept of entropy, as introduced by Claude Shannon in 1948.

Definition 1

Let XX be a source and xx a signal emitted by XX. The information given by xx is called autoinformation and is defined as

I(x)=−log2p(x)I(x)=−log2⁡p(x)

where p(x)p(x) is the probability that the event xx happens.

Definition 2

The entropy of a XX source is the expected value of the autoinformation, i.e. the average information contained in any signal from XX, in particular

H(X)=E[I(X)]=∑Ni=1P(xi)log2p(xi)H(X)=E[I(X)]=∑i=1NP(xi)log2⁡p(xi)
if XX is a discrete variable

H(X)=−∫p(x)log2p(x)dxH(X)=−∫p(x)log2⁡p(x)dx
if XX is a continuous variable

Proposition 1

Let XX be a discrete source, then

0H(X)≤log2N0≤H(X)≤log2⁡N

In particular the maximum of the entropy is log2Nlog2⁡N when all the NN events are equally probable.

shannon

Let's see a simple example.

Example 1
Suppose to have a source XX that emits 
11 with probability pp and 00 with probability 1p1−p.

Then

H(X)=−(plog2p+(1p)log2(1p))H(X)=−(plog2⁡p+(1−p)log2⁡(1−p))

and the entropy is equal to 11 if p=1/2p=1/2 (unpredictable sources), while it is equal to 00 if p=1p=1 (predictable sources, i.e. it always outputs 11 or always 00).

The MaxEnt Algorithm.

Introduction.

The classifier Maximum Entropy is a discriminative classifier widely used in the areas of Natural Language ProcessingSpeech Recognition and Information Retrieval. In particular, it is used in order to solve problems of classification of text such as language detection, topic and sentiment analysis.
The Maximum Entropy algorithm is based on the principle of Maximum Entropy and selects the model that has the maximum entropy (as enunciated by Shannon) on the training set of all tested models.

Recalling Bayes' theorem (see here), the Max Entropy classifier is used when you do not have any information about the prior distribution of the data and it is not correct to make assumptions about.
Unlike the Naive Bayes classifier, the Maximum Entropy has not the hypothesis that the variables are independent of one another, which reflects the nature of the natural text where the variables into account are the words, that of course are not independent of one another since the grammatical rules of the language; moreover, it requires more time in training while providing more reliable results.

Example 2

Before going into the theory behind the MaxEnt, consider an example which clarifies from the outset what will be said in a formal way in the following.

Suppose we want to determine the grammatical form of the word "set."

The word "set" can take the following forms:

·         Adjective: "He has a set smile."

·         Noun: "Give me your chess set."

·         Verb: "They set a fast pace."

We collect a large number of examples from which to extract information to determine the decision-making model pp. The model we're going to build will assign the word "set" a chance to take a particular grammatical meaning.

As we don't have other information from the data, you can impose for our pp model:

p(adjective)+p(noun)+p(verb)=1p(adjective)+p(noun)+p(verb)=1

There are several pp models that hold previous identity, including:

·         p(adjective)=1,p(noun)=p(verb)=0p(adjective)=1,p(noun)=p(verb)=0

·         p(noun)=1,p(adjective)=p(verb)=0p(noun)=1,p(adjective)=p(verb)=0

·         p(verb)=1,p(adjective)=p(noun)=0p(verb)=1,p(adjective)=p(noun)=0

·         p(adjective)=p(noun)=p(verb)=13p(adjective)=p(noun)=p(verb)=13

By analyzing the data set further, let's suppose to get other informations, such as every time the word "set" is preceded by a pronoun is a verb. This, added to the normalization condition, changes the possible chances, reducing the possible models.

The goal of the MaxEnt algorithm is to determine the model pp uniform as possible (maximizing the entropy), according to the information derived from the data, without making any additional assumptions.

 

The algorithm.

We pass now to the explanation of the algorithm.

Consider a text-based document dd and have {w1,w2,…,wm}{w1,w2,…,wm} mm words to each of which corresponds to a particular tag {y1,y2,…,ym}{y1,y2,…,ym} (i.e. a grammatical part of the document: noun, adjective, pronoun, article, etc.). We introduce the concept of "history" of the word wiwi as the possible informations arising from the context in which wiwi is located and we indicate it with xixi.

We make a small example to explain how you can understand the "story" of a word.

Example 3

Consider the sentence "Today is a beautiful day". The set of words is {Today, isjustabeautifulday} and we call "history" of a word the grammatical information of the previous and next word.
For example, for the word "beautiful"

x"beautiful"x"beautiful" = {feminine singular indefinite article - "a", feminine singular noun - "day"}

 

What we have to define is a stochastic model that estimates the conditional probability of getting a yy tag, given a particular "storyxx, namely is p(y|x)p(y|x).

Then we follow the usual classification scheme, i.e. we build our model starting from couples (xi,yi)(xi,yi) of the training set, where xixi is the "story" of the word wiwi and yiyi is the class assigned to it (the grammatical part of speech) .
Consider the probability distribution based on the sample

p~(x,y)=1N|(x,y)|p~(x,y)=1N|(x,y)|

where NN is the size of the training set, while |(x,y)||(x,y)| is the number of occurrences of pair (x,y)(x,y) in the training set.
We introduce the indicator function

fj(x,y):={1 0 y=yi e x holds wkotherwisefj(x,y):={1 y=yi e x holds wk0 otherwise

and consider the fjfj features as variables for the construction of our model.
The average value of fjfj variable compared to the probability derived from the sample is

p~(fj)=∑x,yp~(x,y)fj(x,y)p~(fj)=∑x,yp~(x,y)fj(x,y)

where clearly p~(x,y)=1Np~(x,y)=1N whether each pair of the training set has occurrence 11.

While the average value of fjfj variable with respect to probability model p(y|x)p(y|x) is equal to

p(fj)=∑x,yp~(x)p(y|x)fj(x,y)p(fj)=∑x,yp~(x)p(y|x)fj(x,y)

We impose the condition that the average value of the model is limited to fjfj on the training set, i.e.

p~(fj)=∑x,yp~(x,y)fj(x,y)=∑x,yp~(x)p(y|x)fj(x,y)=p(fj)p~(fj)=∑x,yp~(x,y)fj(x,y)=∑x,yp~(x)p(y|x)fj(x,y)=p(fj)

We now have so many conditions as the previous ones for each fjfj, which can be met by several models. To choose the best conditions, we use the principle of Maximum Entropy, by selecting the closest possible model to standard form (maximization of information) .

In particular, it shows that exists and a well established model pp∗ that maximizes the entropy of a system with CC constraints.

In our case the problem is the following:

Determine pp∗ such that

p∗=argmaxpC H(p)=argmaxpC (−∑x,yp~(y|x)logp(y|x))p∗=arg⁡maxp∈C H(p)=arg⁡maxp∈C (−∑x,yp~(y|x)logp(y|x))

with

·         p(y|x)≥0 ∀x, yp(y|x)≥0 ∀x, y.

·         yp(y|x)=1, ∀x∑yp(y|x)=1, ∀x.

·         x,yp~(x,y)fj(x,y)=∑x,yp~(x)p(y|x)fj(x,y)∑x,yp~(x,y)fj(x,y)=∑x,yp~(x)p(y|x)fj(x,y).

We use Lagrange multipliers to solve it :

Ξ(p,Λ,γ)=−∑x,yp~(y|x)logp(y|x) +∑jλj(∑x,yp~(x,y)fj(x,y)−p~(x)p(y|x)fj(x,y)) +γxp~(y|x)−1Ξ(p,Λ,γ)=−∑x,yp~(y|x)logp(y|x) +∑jλj(∑x,yp~(x,y)fj(x,y)−p~(x)p(y|x)fj(x,y)) +γ∑xp~(y|x)−1

obtaining the solution

p∗(y|x)=γ exp(∑jλjfj(x,y))=1yexp(∑jλjfj(x,y)) exp(∑jλjfj(x,y))p∗(y|x)=γ∗ exp(∑jλjfj(x,y))=1∑yexp(∑jλjfj(x,y)) exp(∑jλjfj(x,y))

At this point we insert in the Lagrangian the values ​​of pp∗ and γγ∗ and get ΛΛ∗, maximizing the function that follows for which it is stated that (without proving it here):

·         is the log-likelihood of the exponential model p(y|x)p(y|x);

·         the maximum problem cannot be resolved analytically but by numerical methods (conjugate gradientGIS - Generalized Iterative Scaling), taking advantage of the regularity and the convexity of the function for each λjλj.

So, as made the POS Tagger training on a given set (training set), we proceed to make a classification on new words to test (test set). In our case the POS Tagger, already trained on a large enough data set, is used for the classification of words present in IMDb's review; discarded all the words that are not classified as adjectives, then we will assign adjectives to families based on the frequency in the review of a specific lineage.