Return to How to detect automaticaly hoaxes ?

Multimodal approach


This paper presents a multi-modal hoax detection system composed of text, source, and image analysis. As hoax can be very diverse, we want to analyze several modalities to better detect them. This system is applied in the context of the Verifying Multimedia Use task of MediaEval 2016. Experiments show the performance of each separated modality as well as their combination.



When studying SN, one interesting aspect is the publication propagation, e.g. news, facts, or any information considered as important and shared across communities. A major characteristic of the propagation is its speed. However, users rarely verify the veracity of the shared information. Moreover, verified false information is often shared and spreading can not be contained [yang2010modeling, situngkir2011spread].

Therefore, we are studying how to verify directly the veracity of any information. Our goal is to create systems that can inform users before sharing false information. Consequently, we are extremely interested in the Verification Multimedia Use task of MediaEval 2016, which aims at classifying Twitter publications to detect fake information [boididou2015verifying]. Considering the nature of tweet data, diverse information coming from the message and its meta-data can be extracted. We explored in this work the predictive power of various features. We propose different approaches based on text information, source credibility, and image content.

We propose four approaches: text-based (run-T), source-based (run-S), image-based (run-I), and the combination of the three approaches (run-C). For all of these methods the prediction is first made at the image-level, then propagated to the tweets that contains the image, according to the following rule: the tweet is predicted as real if all the associated images  are classified as real; if at least one of the images is classified as fake, the tweet is considered as fake.

Text-based nearest neighbors prediction

This approach exploits the textual contents of the tweets and do not rely on any external data apart from the training set. As previously explained, a tweet is classified based on the images it contains; an image is described by the concatenated texts of every tweet containing this image. The idea here is to capture similar comments between an unknown image and an image from the training set (such as It’s photoshopped) or similar genres of comments (presence of smileys, slang/journalistic languages…).

Let us note Iq such a description for an unknown image, and { I(di) } the training set of image descriptions. The class of Iq is decided based on the classes of the k similar image descriptions in { I(di) } . In practice, to compute the similarities, we use a state-of-the-art information retrieval approach called Okapi-BM25 [RWB98]. A language-detection system (based on the Google translate service is used to detect non English tweets, which are then translated into English with Google translate. As another preprocessing, we use orthographic and smiley normalization tools developed in-house. The parameter k was set to 1 by cross-validation.

Trusted sources prediction

This approach, already used by [middleton2015extracting], is conceptually the simplest but rely on external (static) knowledge. As for the previous run, prediction is made at the image level, and an image is represented as the concatenation of every tweet (translated in English if needed) in which it appears. The prediction is made by detecting trustworthy sources in the image description. Two types of sources are searched: 1) a known news-related organism; 2) an explicit citation of the source of the image. For the first types, we gathered lists of press agencies in the world, newspapers (mostly French and English ones), news TV networks (French and English ones). For the second types, we manually defined some patterns, like photographed by + Name, captured by + Name, etc. Finally, an image is classified as fake by default, unless a trustworthy source is found in its text description.

Image retrieval prediction

In this approach only the image content is used to provide a prediction, at the image level. Note that some tweets do not contain images but videos; such tweets are thus labeled as unknown.

Images from the Verification Multimedia Use task are classified using external information. We perform image retrieval, which consists in querying a database of known fake/real images to discover already known fake images. The database is built by collecting images from 5 specialized websites, i.e,,,, and The set contains around 500 original images and 7500 fake samples.

Generic image descriptors are computed using the very deep Convolutional Neural Networks (CNN) [simonyan2014very]. First, we apply the convolutional layers [tolias16] of the network on images scaled to a standard size of 544×544. Then, the two first fully connected layers are kernelized and applied, on the output feature map, producing a new 11x11x4096 dimensional feature map. Finally, average pooling followed by l2-normalization is performed, giving a 4096-dimensional descriptor [cimpoi16,sicre15,sicre15B]. Once all images descriptors are obtained, cosine similarity is computed between the query and all images from the database. If the highest similarity is higher than a threshold of 0.9 (set on the training dataset), then the query receives the label of the most similar image. Otherwise, the query is labeled as unknown.



This last approach aims at combining the three preceding ones in a late fusion process. Thus, for a given image, it takes as input the predictions given by the three systems describe above. As before, the final prediction on the image is then propagated to the tweets containing it.

Instead of using a simple fusion process (for instance, a majority vote), we try to automatically build a fusion model fine-tuned to the task. We thus use a machine learning algorithm, namely boosting (adaboost.MH) over decision trees [bonzaiboost], which takes as input the predictions of the three previous approaches, and also the scores associated to these predictions (for run-T and run-I). The parameters of the machine learning algorithm are set by cross-validation on the training data: the number of iterations for boosting is 500 and the depth of the trees is 3. Finally, the fusion model is learned on the whole training set; it is then used on the test set images.






The four approaches are applied on the MediaEval 2016 test set and results are reported above. The test set is composed of 2228 Twitter messages associated with 130 images. Moreover, 65% and 26% of the tweets of the development and test set respectively are associated with a single event. We observe that the approach based on the source trustworthiness level (run-S) outperforms the text-based approach (run-T), which outperforms the image-based approach (run-I). We can see that the text-based approach competes with the source-based approach in terms of recall. It means that the text approach tends to classify every tweet as fake. This may be explained by the fact that the training set is unbalanced as it contains 3 times more fake than real.

We note that the prediction based on the image approach has several drawbacks and performs poorly. In particular, the precision is low compared to what we estimated on the training set. Several explanations can be given. First, only 86% of the test tweets are associated with one or more images (the rest are associated with video content), meaning that the image approach is evaluated only on this portion of the dataset. Therefore, recall and F-score are directly impacted. Secondly, the reference database that we built is small and unbalanced, resulting in a high number of unknown labels in the predictions. Thirdly, the base does not always contain the original images and small modifications between forged image and its original version can be considered as similar. Finally, images shared on SN often present specific editing characteristics, as visible added watermarks like fake, rumor or real, circles, text annotations, etc. Such edits impair the similarity computation between images.

Concerning the run-C, we note that the combination using late fusion does not offer any gain, and perform even worse than the run-S alone. This result is disappointing, as it differs from what we evaluated on the training set by cross-validation. It may be explained by an overfitting problem when learning the fusion model, and by the lower precision (compared to the one estimated on training set) obtained by the run-I which is used as input.



Permanent link to this article: