Artificial Intelligence confirms you are an a**hole
I called it Q-LO (for the nerds out there here’s the github). Q-LO is a small artificial brain that can determine if you are the a**hole or not in a situation from its description.
Right but how does it decide? Is it good at that? …How does it dare?!
To answer these questions we need to consider what Artificial Intelligence (AI) means in practice (the word is abused by almost everyone), how it is built, what its limits are, and some ethical questions too.
How does it work?
Q-LO is a neural network of exactly 501 neurons, so not really a Nobel prize winner, especially considering a fruit fly has around 300 times more neurons than that. The idea consists in imitating the way in which a biological brain takes decisions, to apply it to problem-solve very specific tasks.
Today neural networks are all the rage in AI development, from AlphaZero to self-driving cars, but the concept is far from being new: it had already been developed in the 50s , but since computers back then had less computing power than a modern fridge the idea was abandoned.
So what’s the intuition about? Here’s a sketch:
In this image I drew an 8-neuron network looking to approximate how we decide to buy online. The assumption that was made is that we have 3 main inputs for our decision:
- feedback (star-rating)
- cost (in currency)
- delivery time (hours)
We will have 3 input neurons, each of which has the potential to pass a signal on based on its activation function (e.g. probably the neuron for the star-rating feedback won’t react neither positively nor negative for 3 stars out of 5)
The signal will be the input multiplied by a specific weight (representing a positive or negative influence on the “buy” decision), to which an arbitrary bias is added (neuron’s predisposition to pass a signal).
The same process is repeated in successive hidden layers of neurons, which take the signal of the previous neuron as their input (here’s where the magic happens in these models). In the image above there are two hidden layers of two neurons each.
Finally the last neuron produces an output which is the final decision on the purchase. The learning happens when the AI measures the difference between its decision and ours on the same inputs (supervised learning): if I decided to buy the “target” label is 1; let’s imagine that the network output 0.76 — at that point it will take the 0.24 difference and it will use it to change its way of thinking by adjusting its weights and biases.
This process is called backpropagation and often it takes place on a batch of predictions instead of each time.
Cool story, but what about the a**holes?
Here things get a bit more complicated. Obviously thinking that price can be used to approximate our purchasing decisions seems like a good intuition. But how to calculate who’s the a**hole?
A feasible approach is to use Natural Language Processing (NLP). In short NLP allows to represent words in a meaningful way for Q-LO (our model), while preserving their semantic relationships.
This is achieved by scattering a linguistical Corpus in a multidimensional space, and then using a numeric value for each dimension to point to a specific word, thus generating a word vector.
Using these vectors it becomes possible to perform mathematical operations on words in a sensible way, obtaining a new vector as a result, for instance: queen — woman + man = king
With this approach even entire texts can be represented as the average of their word vectors!
“Stop beating around the bush!”. What about the a**holes?!
We got there. The intuition is that if we described our situation to a predictive model (i.e. neural network), in a way that is intelligible to the model (i.e. word vectors), one could approximate human decisions on whether or not we’re the a**hole.
But where can we find enough data to train Q-LO?…the internet of course! A very polite quantitative researcher has published a dataset of about 100.000 post from the /r/amitheasshole Reddit thread, in which strangers ask other “Am I the a**hole for having done X (i.e. to person Y)?”. The community can respond with: “a**hole”, “not an a**hole”, “no a**holes here” or “everybody sucks”.
Here’s an example:
AITA for eating more than 50% of the food me & my SO make?
I’m a 100kg man and she’s 50kg woman, I think I should get more food than her at dinner but she thinks that’s unfair. I argue that because I’m a man I need more calories, but she just thinks fair is 50/50.
What do you think?
The creator of the dataset herself tried to test the intuition of classifying between a**holes and not a**holes and her model scored 62% accuracy overall.
That’s not bad! Especially considering that her model is a logistic regression classifier (not very powerful on non-linear problems) based on frequencies of words use (which cannot encode a lot of semantically meaningful information).
At this point there nothing else to do than build Q-LO and try to beat her results:
- Data pre-processing:
After having obtained the posts and their final judgement (simplified in “a**hole” and “not an a**hole”) they need to be cleaned and prepared to be converted in vectors (removing stopwords like a / an / the, odd punctuation, etc). At that point, since we have more non-a**holes than a**holes in the dataset we need to balance them artificially.
- NLP:
Now it’s time to convert the posts in vectors. Instead of using a ready-made model, for instance one trained on Wikipedia articles, I generated a new one based on Doc2Vec, since the language in these posts is very specific. I chose 200 dimensions for the vectors as it is common practice to have a dimensionality size between 100 and 300. Finally we need to divide the dataset in training set and test set to evaluate the model, a bit like a final exam.
- Neural Network:
Now we need to understand which shape to give to Q-LO’s “brain”. Therefore I left my computer running for around a day (with GPU GTX 1070 8GB) while it tried different combinations of parameters like the number of neurons, number of training epochs, batch for error backpropagation, etc.
For each of these combinations Q-LO performed 5 tries to learn from different sets of posts (i.e. 5-fold cross-validation), measuring the standard deviation of its own accuracy. This step is performed, together with selectively switching-off neurons (i.e. dropout) to avoid that the model learned the right answers “by heart” (i.e. overfitting)
For the curious among you, the best parameters found were: 100 epochs, 150 hidden layer neurons, batch size of 64, Adam optimiser. This was the first success for Q-LO. Average accuracy: 70.4% with a standard deviation of 0.2%!
- Model validation:
The moment of truth has come. We need to test Q-LO on data it has never seen before (the test set). If its accuracy is lower than the training one it would mean that it is not capable of generalising and that it learnt to recognise the a**holes from the training set “by heart” (i.e. overfitting).
The accuracy on the set test was 71.44% , which is higher than on training and definitely higher than our 62% benchmark! Q-LO made it!
Are we trusting this thing then?
Here’s its school marks:
What do these numbers mean? For both categories under consideration, a**holes and not a**holes, two metrics are interesting in particular to understand how good Q-LO is: precision and recall.
Taking for example the a**hole category, Q-LO’s precision will tell us how many “true” a**holes it classified correctly, divided by the same number plus the true non-a**holes that it misjudges.
The recall instead indicates still how many “true” a**holes it classified correctly, but this time divided by the same number plus the true a**holes it did not spot.
This allows us to say that Q-LO is around 71% all round but maybe it is a tad too kind to some “true” a**holes and while it is also unfair to some non-a**holes it does so to a lower degree.
Let’s see, as an example what it thinks of the man that wants to eat more than their significant other from the example above:
Let’s try now with a personal example:
AITA for having studied political science and international relations and then deciding to spend my time writing articles on artificial intelligence for fun? Tangentially I should add that I plan to release my last one on AI to spot a**holes in English as well!
Alright then, now I feel authorised to share with you this story!
So what’s the end? Do we need to embrace Q-LO’s decisions based on the numbers?
First of all, I hope this article can help demystifying the usual banal discourses on AI, most often than not from people who repeat words with the conviction that they are “in fashion”.
Q-LO is not an anthropomorphic robot with general intelligence. The vast majority of AI research and production is aimed at building stuff like Q-LO.
Secondly, I hope this article helps in pointing to some of the limits of AI and provide some points for discussion, for instance, Q-LO:
- Formulates some judgements based on a very specific use of language and concepts that be very culturally specific (i.e. on living with parents). It is specific and not general.
- Willingly approximates people’s unconscious biases with its own weighrs and biases. This depends also on how the data is selected: an Italian point of view on what being an a**hole mean might not be included. This could propagate injustice.
- It is possible to verify which decision it will take starting from some inputs but it is not possible to know why that decision is taken. It lacks observability.
- There is no standard way to measure performance (everyone follows their own best practices)
- Even conceding there is an “appropriate” way to present the results in numbers, these would not make decisional outcomes objective, or necessarily desirable.
I hope you found this article interesting. See you next time, don’t be a**holes or, if you really must, make sure not to be spotted by Q-LO!