Improvement of natural image search engines results by emotional filtering

With the Internet 2.0 era, managing user emotions is a problem that more and more actors are interested in. Historically, the first notions of emotion sharing were expressed and defined with emoticons. They allowed users to show their emotional status to others in an impersonal and emotionless digital world. Now, in the Internet of social media, every day users share lots of content with each other on Facebook, Twitter, Google+ and so on. Several new popular web sites like FlickR, Picassa, Pinterest, Instagram or DeviantArt are now specifically based on sharing image content as well as personal emotional status. This kind of information is economically very valuable as it can for instance help commercial companies sell more efficiently. In fact, with this king of emotional information, business can made where companies will better target their customers needs, and/or even sell them more products. Research has been and is still interested in the mining of emotional information from user data since then. In this paper, we focus on the impact of emotions from images that have been collected from search image engines. More specifically our proposition is the creation of a filtering layer applied on the results of such image search engines. Our peculiarity relies in the fact that it is the first attempt from our knowledge to filter image search engines results with an emotional filtering approach. Received on 14 April, 2015; accepted on 24 January, 2016; published on 25 April, 2016


Introduction
Needs from users of the Web are evolving with new technologies.Nowadays, people are accustomed to participate and give their opinions on websites or social networks.For example, they share everyday emoticons, likes on Facebook, +1's on Google+, short messages on Twitter.They even share their professional networks connections on LinkedIn or Viadeo.The Internet of Things is a technology that now allows to interact with computers not only with keyboard and mouse but in new user-oriented devices like connected watches, pedometers, refrigerators, electronic scales or Google glasses for instance.
Web search engines users are often lost in the huge amount of answers they got from anonymous requests and want specific and personalized replies to their own needs.Semantic Web, sometimes called Web 3.0, is evolving to focus more and more on structuring its data to answer more precisely and personally to these users.
In this context, emotional status of users is now taken into account by recent works.Business industries are in because they think they could increase their customers market delivering better customized products arousing positive emotions.This paper proposes to contribute to these researches introducing a new layer to image search engines filters according them to add an image emotion status.We do not known any search engine that has addressed these issues the way we have.
We first introduce in Section 2 background on image search engines (2.1).Emotion definitions and modelling are given in 2.2.We discuss in Section 2.2 how emotions have been extracted from digital documents in the past.More details are given specifically on texts and images.3 is devoted to the description of our proposed system.We first present Support Vector Machines technology (SVMs), used for our classifiers.Then, an image database, called SENSE, is presented and we justify its usage to train our system.Next, we expose a bottom-up saliency model we used to focus on the most important emotional information of each image.Then we detail the whole processing string our system.We show some of its results on our Internet crowded image database of cooking recipes and compare it to other systems.Last section, (4), is kept for conclusions and perspectives.

Image search engines filtering
Image research engines often allow filtering their results.This is usually done by extraction and comparison of image features.Those can be deduced from bottom-up or top-down approaches [15].Bottomup methods extract low level features on image data, while top-down methods are task-driven ones using supervised learning.Common filtering techniques get information from direct access within the image file metadata.Examples are bottom-up features like the size; dimension proportion ratio: height, width; style: photo, drawing.Top-down classical features extraction from metadata includes: characters, faces, portraits with head and shoulders. . .The number of indexed images where features were extracted is also a criteria to compare search engines filters.Table 1 shows a comparison on common filters.
Here is a non exhautive list of several famous and not so famous image search engines: Google images1 ; Bing images (Microsoft)2 ; TinEye3 ; Picsearch4 ; oSkope visual search5 ; Search22 6 .
The following section is about emotion modelling and extraction within digital documents.

Emotion and digital images
Human emotions are answers of human beings to their environment.This environment is interpreted by their brains through several sensors: eyes, ears, nose, mouth, skin. . .People use their emotions to interact with other humans.They rely on their empathy to interprets these emotions correctly.
Emotion modelling.Many definitions of emotion have been published and differ from different schools of psychology.Refering to computational approach of emotion recognition, two theories are considered [25]: • Basic emotion theories [39,45] establishing the existence of basic emotions.These basic emotions are also referred as fundamental or primary.
• Theories of evaluation defining emotion as a set of appraisal states that occur when a human being is faced with an external stimulus.
These two theories have defined the two more used emotion classification: 1. Discrete approach [10,37]: emotional process can be explained with a set of basic or fundamental emotions, innate and common to all human (sadness, anger, happiness, disgust, fear, . . .).There is no consensus about the nature and number of these fundamental emotions.This modelling is usually preferred in emotion extraction based on facial expressions.An example of discret classification of emotion is given in the Figure 2(a); it represents the Plutchik's wheel of emotions [37].The author defines eight basic emotions; combinations of primary emotions define secondary ones.2(b), is the most used with the dimensions valence and arousal: • The valence corresponds to the way a person feels when looking at a picture.This dimension varies from negative to positive and allows to distinguish between negative emotions and pleasant ones.
• The arousal represents the activation level of the human body.
The advantage of these models is to define a large number of emotions.Despite this, some emotions can be confused (such as fear and anger in the circumplex of Russel) or unrepresented (among others surprise in Russel's model).
Emotion extraction.As a reaction to our environment, emotions are linked to our five senses.But let's face that only seight and hearing are used in our digital world of multimedia documents.Many researches have been made to achieve emotion extraction by computers and improve interaction between humans and machines.The holy grail to achieve being able to bring human-empathy to computers aptitudes.Though, the multiplicity and diversity of digital documents is still a major research challenge for computers to correctly gather and interpret emotions.Moreover, extraction of emotions can either be targeted to the content or to the user of the document.In other words, the goal may be to extract emotions intrinsically embedded into a document or those that will arouse on humans reading it.
From audio To decode emotions from audio files, focus is directed on our hearing aptitudes.Everyone knows that music or voice provoke emotions on people.For instance, one can often tell the mood of another person by listening her even in a foreign language.Competitive challenges like the avec 7 or the iscaspeech 8 for instance are gathering new contributors every year [22,31].Table 2 summarizes these emotion extraction approaches on audio documents.
The majority of digital documents are captured by our eyes.Most contributions in the literature are then focused on visual documents as we do in the following.From text documents The first scrutinized documents to extract emotions were digital textual documents.Text is linked to language.If computers could interpret texts, they would be able to interpret our emotions expressed.This textual information, is still combed to decrypt human emotions.For instance, Salway and Graham used audio description text streams to extract characters emotions in movies [41].Kamvar and Harris extracted emotional information of textual contents from bloggers all around the Internet [20].

EAI
With social media, needs to express emotions brought up emoticons.They were historically defined by small groups of characters, usually two or three, representing a facial expression (:-) :-| :-( (^_^) and (;_;) are examples).Now they are single character encoded ( or ).Emoticons were used to dig up emotions from text documents [38,46].Hu et al. defined different emotional signals extracted from social media datasets (Twitter and Facebook short messages) to built an unsupervised sentiment analysis system [17].Lin et al. proposed an emotion classification system based on five textual feature types [24].They pointed out that authors and observers points of view on emotions are often not the same as in [21].
Table 3 summarizes these emotion extraction approaches on textual documents.
The following paragraph focuses on emotion extraction from image documents.
From image documents Image emotional extraction is a challenge because searchers are still learning how to link computer language and human being interpretations of images.The first approaches to compute image emotions were to find associated text.That way, previous works done on text (as in Section 2.2) helped extract emotions on images.
Emotion search engines or interface based on textual metadata and users tagging were proposed in [11,18,21,52].
A large part of the literature is devoted to the links between emotions and colours [2,6,7,27,[32][33][34]49]. In fact, a consensus states the existence of a link between colours and particular emotions.As stated by Ou et al. [32], colours play an important role in decision-making, evoking different emotional feelings.In a serie of publications, Ou et al. [32][33][34] studied the relationship between emotions, preferences and colours.They established a model of emotions associated with colours from psychophysical experiments.
Emotions are also extracted from facial features (such as eyebrows, lips) using faces contained in images [35].It seems to be the easiest way to predict emotions, since facial expressions are common to human to express basic emotional feelings (happy, fear, sadness).
More recently some authors considered the emotion recognition as a CBIR task [28,44,50].The underlying idea is to use the traditional techniques of image retrieval to extract the emotional impact.Extraction of traditional image features (colours, textures, shapes) combined with a classification system allows the authors to predict the emotional impact of images after a learning step.It is often necessery to add complementary information.For example, Wang and Yu [48] used the semantic description of colours to associate an emotional semantic to an image.Liu et al. [25] stated that oblique lines could be associated with dynamism and action; horizontal and vertical ones with calm and relaxation.
Although research has been made to characterize still images emotions with low level features extraction as said above, we do not know any emotional image search engine focused only on low level extraction from image as we developed in this paper.However some studies have been made on image search engines.
Table 4 summarizes these emotion extraction approaches on images documents.
Most of these former presented papers use text metadata mining to extract emotional information about images in their search engines.A novelty of our paper is that we do not use emotional associated text, or as Solli emotional Bags of Emotions [43], but we pull out emotion in images from bags of visual words [4] based on low level descriptors extracted from pixel level processing.

Proposed system
Our proposed method starts by a Query By Image Content (QBIC) and defines an emotional filter system based on low level images features.This filter can then be applied on image search engines results in Web technologies for instance.For that, we processed an emotional score for each image in a set acquired by QBIC.This score is gained by a fusion of five classifier scores.The classifiers are based on Support Vector Machines (SVMs) [9] described in Subsection 3.1 and powered by libSVM [8].Classifiers are trained on a low semantic database called SENSE which is described in Eventually, the whole stream of our emotional filtering system is described in details in the last Subsection 3.4.

Support Vector Machines
Support Vector Machines (SVMs), introduced as Support Vector Networks in [9] are supervised learning models used for classification and regression analysis.
We briefly introduce the basics of SVMs to explain how we used them in our approach.When one want to perform classification, he has to get data first.A model is learnt splitting data into training and testing sets.Each sample of the training set is described by a feature vector and labeled to the class it belongs to.The goal of SVMs is to produce a model, based on the training set which predicts the target class or label of every sample of the test set given only the its attributes (feature vector).To achieve that, samples projected in a dimensional space are separated by a linear hyperplane.
SVMs classification process is illustrated by Figure 3. Let (x i , y i ), i = 1, ..., m where x i ∈ R n , n the dimension of the feature vector and y ∈ {1, −1} m be a training set of m sample-label pairs.In both sckeches, the separating hyperplane, defined by equation w.x + b = 0, is represented by the black line.For every sample, r = |w.x+b|w is the distance to the separating hyperplane.The support vectors (rounded points) are the closest points to the separating hyperplane.d is the distance between the two classes (red and blue).It is also called the margin.The goal of SVMs is to maximize this distance.In the case of hard margin SVMs (Figure 3.a), SVM finds a linear separating hyperplane with the maximal margin.When using soft margin (Figure 3.b), an error term ξ is introduced to take classification errors and noise into account.C > 0 is the penalty parameter.It is a constant used to control the balance between the number of classification errors and the margin width.
The support vector machines (SVMs) [5,9] require the solution of the following optimization problem: Here training vectors x i are mapped into a higher (maybe infinite) dimensional space by the function Φ.Furthermore, k(x i , x j ) ≡ Φ(x i ) T Φ(x j ) is called the kernel function.Though new kernels are often being proposed by researchers, the most known and most commonly used in the literature are the following: • linear: In addition to performing linear classification, SVMs can efficiently perform a non-linear classification using what is called the kernel trick, implicitly mapping their inputs into high-dimensional feature spaces.This property of the SVMs is not described furthermore here as we used linear classification.

SENSE database
Several image databases have been used to study emotions [28,44,50].The most known and used 9  is the International Affective Picture System (IAPS) 10 [23].As there were no universal emotional image database focused on extraction of emotional impact, Gbèhounou et al. [13] proposed one.This image set is named SENSE for Studies of Emotion on Natural image databaS.It is composed of images as diversified as possible and with low semantic power.That is to say that the images do not shock or force a strong emotional answer during subjective evaluation in opposition to some IAPS images for example.Each image has been evaluated by users during very short period of time: eight seconds.This allows to look for primary emotion impact of the presented images.In fact, Schyns et al. demonstrated that emotions on facial expressions [42] are transmitted during the very first 200ms after the reception of the observer signal.During this very short period time, semantic interpretation can not be done by the user.On the contrary, image from the IAPS, embed high semantic power.High semantic can introduce a bias on the evaluation of multiple images as strong emotions remains longer that law emotions in the mind of evaluators [40].The authors focused to include only low semantic images in their database in order to minimize or to remove that kind of bias during evaluation.The goal was to evaluate each image for its content and not for the remnant of the previous image emotion evaluation.The database is free to use for research and contains 350 images of landscapes, characters, animals, food, drink, historic and tourist buildings images as shown in Figure 4. Furthermore, the database also contains very few images with human faces (4.86%).This last point reflects the desire to limit the interpretation of facial expressions which embed bigger emotional reactions. 10from the Center for Emotion and Attention (CSEA) at the University of Florida The SENSE database was then evaluated in different subjective experiments.Twenty-five subjects participated in the experiments, 28% women and 72% men.Half were aged 18 to 24 years.They took part on voluntary basis and did not receive a financial reward.The test strategy is closely linked to the model of emotions the authors have chosen.They decided to work with a dimensional model which dimensions are: • The nature of the emotion, • The power of the emotion.This emotion modelling is equivalent to Valence/Arousal model in which, the valence allows to distinguish positive and negative emotions and the arousal that varies from low to high defines the arousal body.This parameter is to describe the intensity of the emotion associated with the nature of the chosen emotion.During the different tests, observers evaluated the nature and the power of the emotion aroused by the image.As shown on Figure 5, for the nature of the emotion, the participants had choice between "Negative", "Neutral" and "Positive".The power was chosen from low to high.During the tests, some presented images have the same content but are modified with some transformations.Rotations and changes in colour balance are applied.Even if they change the natural aspects of some pictures, they allowed to see the results of some invariants features used to recognise the emotion associated with the image.Only 2.29% images represent these transformed "non natural" content.
The next subsection will describe why we chose to process our features extraction algorithms on salient thumbnails instead of full images.

Bottom-up saliency model
During the emotional evaluation of images of SENSE, the authors realized that a number of pictures were  not clearly classified by all users.This result is not surprising because of the construction of the image database.In fact, images on the database are selected so they do not cause strong emotional answers to observers.Therefore it seems logical that a number of them are not easily emotionally classifiable.A new experiment was then mounted: evaluate the most salient areas for each SENSE image.Maybe focusing on the most salient zones would allow better emotional classification?This experiement was made in [13] resulting in a novel SENSE2 database.For each SENSE image, its most salient region counterpart was added to SENSE2.SENSE2 analysis showed a correlation of emotion evaluations from each SENSE image and their salient counterpart in SENSE2.
In our proposed system, as we will see in Section 3.4, we extract emotions on images from CBIR techniques.We designed our system to work on low semantic images as we think that images answers from image search engines will not embed strong emotional impacts on users.This is why we choose to process features from image thumbnails acquired by a saliency model.The saliency model we use was developed in our laboratory from previous works by Perreira et al. [36] and is based on a computational model of attention.
This was implemented to improve the performances of our system: • working only on smaller images enhances time processing; • focusing on most significant regions performs better emotional classifications.
Many researchers have worked on visual attention, one can report to [3] for a good review.A classiquely used model of visual attention is Laurent Itti's [19].The first part of its architecture relies on the extraction of three conspicuity maps based on low level characteristics computation, that's correspond to the production of information on retina.These three conspicuity maps are color, intensity and orientation.
The second part of Itti's architecture proposes a medium level system which allows merging conspicuity maps and then simulates a visual attention path on the observed scene.The focus is determined by a "winnertakes-all" and an "inhibition of return" algorithms (Figure 6).
Perreira et al. proposed to substitute this second part by their optimal competitive architecture: a preys / predators system.This optimal criteria, preys / predators equations are particularly well adapted for such a task: • preys / predators systems are dynamic, they include intrinsically time evolution of their activities.Thus, the visual focus of attention, seen as a predator, can evolve dynamically; • without any objective (top-down information or pregnancy), choosing a method for conspicuity maps fusion is hard.A solution consists in developing a competition between conspicuity maps and waiting for a natural balance in the preys / predators system, reflecting the competition between emergence and inhibition of elements that engage or not our attention; • discrete dynamic systems can have a chaotic behavior.Despite the fact that this property is not often interesting, it is an important one.Actually, it allows the  General architecture of the Perreira model is represented in Figure 7 Starting from the "basic" version of preys / predators equations, Perreira et al. enriched processing in several ways: • the number of parameters can be reduced by replacing s by s.Indeed, mortality rates differences between preys and predators can be modeled by an adjustment of factors b and m I • the original model represents the evolution of a single quantity of preys and predators over time.It can be spatially extended in order to be applied to 2D maps where each point represents the amount of preys or predators at a given place and time.Preys and predators can then "move" on this map using a classical diffusion rule, proportional to their Laplacian C and a diffusion factor f .
• natural mortality of preys in the absence of predation is not taken into account.If the model only changes temporally, mortality is negligible when compared to predation.However, when the model is applied to a 2D map (which is the case in our system), some areas of the map may not contain any predator.Natural mortality of prey can no longer be considered negligible.A new mortality term −m c need to be added to the model.This yields to the following set of equations, modeling the evolution of preys and predators populations on a two dimensional map: A last phenomenon can be added to this model: a positive feedback, proportional to C 2 or I 2 and controlled by a factor w.This feedback models the fact that (provided unlimited resources) the more numerous a population is, the better it is able to grow (more efficient hunting, higher encounter rater favoring reproduction, etc.).The final preys / predators system is then: In order to simulate the evolution of the focus of attention, Perreira et al. proposed a preys / predators system (as described above) with the following features: • the system is comprised of four types of preys and one type of predators; • these four types of preys represent the spatial distribution of the curiosity generated by four types of conspicuity maps;  • the predators represent the interest generated by the consumption of curiosity (preys) associated to the different conspicuity maps; • the global maximum of the predators maps (interest) represents the focus of attention at time t.
The authors showed that despite the non deterministic behavior of preys / predators equations, the system exhibits interesting properties of stability, reproducibility and reactiveness while allowing a fast and efficient exploration of the scene.
With this visual saliency model, we can focus our emotion extraction on the minimal local region of an image where the its emotion impact should be the most meaningful.In other words, this enables us to summarize every image emotion by its most salient thumbnail counterpart and enhances as well feature extraction processing time as thumbnails dimensions are smaller.
The next subsection represents the core of our system, it details how we constructed our emotional filtering structure layer by layer.

Emotional filtering system structure
Our emotional filter is based on an emotional score processed from a machine learning approach.The idea is to process a signature for a request image and to inject it on trained classifiers in order to output an emotional score.The global system is illustrated by Figure 1 and Figure 8.
First, we collect an input set of images by a textual or image request from any image search engine.Then we extract the main salient region for each image of the set using the computational model of attention [36] formerly described in Section 3.3.This additional processing is justified by the fact that the emotional impact of an image is correlated to its main salient region [13].Moreover, as thumbnails, salient regions take less weight and then less processing time.
Next, each salient image (thumbnail) is processed by a Harris Laplace key-point detector [16,29].A dynamic algorithm is performed to ensure a minimum amount of detected key-points to describe.The size of the thumbnail to process is sometimes enlarged to ensure a minimal amount of detected key-points.
Then a k-means reduction algorithm is applied on each vector to reduce the dimensionality of feature vectors.An image CBIR's Bag of Words (BoW) method [4] is used to product visual words from these feature vectors as it was for training of the emotional classifiers in [12].Training process is illustrated in Figure 9.Note that emotions are trained offline on the SENSE database.
So we classify each image with these five classifiers by the same training sets used with the SENSE database in [12] using Support Vector Machine classifiers (SVMs) implemented from libSVM library with linear kernels [8].Each classifier outputs a real score between -1 and +1.-1 stands for a negative emotion impact whereas +1 stands for a positive emotion impact.All the scores are merged by a simple addition into a final emotional score.This is illustrated by Figure 10.

Results and discussions
A 100, 000 Internet cooking recipes images database was created by a partner lab in our project.Figure 11 shows image examples of its content.That was a good opportunity to test our model on a large scale.This database is full of a large diversity in the shapes and looks of the displayed dishes.We so presumed that its content do not imply hight emotional impact on viewers.This is then why we choose to use the SENSE database as a trainer for our application.
Our system can output different views.First, images can be displayed by processed emotional score order; Figure 8 shows some results of our system on this cooking database.We can see how our system is filtering images and that the result lacks of any  emotional validation.But again, the main novelty of our proposition was to apply for the first time an emotional filtering layer on the results of image search engines.This validation process will be investigated on future works.Another contribution was to use low-level features from the most salient regions only to describe image emotions.
In [14], authors used a hybrid approach to define a notion of attractiveness in web image search.This is one of the nearest works we could find from our proposed system.They used EXIF metadata associated with trained high level visual features (quality assessment, aesthetic prediction, affective classification) to define their attractiveness ranks.They also confess that their results are still widely improvable.To improve them, they add images EXIF textual metadata to their model.They also look into the image source web pages for other textual information to help in the classification process.
Our approach here is different, we want to extract the emotion information only on the image content.We are not at all interested with additional information to enhance the performance of classification as it is usually done in other approaches.Also our system scores image sets where theirs scores Web pages where images can be found.

Conclusions and Perspectives
In this paper, we focused on the impact of emotions from images collected from research image engines.We proposed for the first time to add an emotional filtering system to image search engines results.
With this new filtering technique, web users will enhance their experience on image searches.
We saw that our solution still needs a validation phase.We thought of several centered on human evaluation: • a standalone application where evaluators will validate the engine classification.Image search engine result sets will be labeled by people on an emotional perspective.Comparison between labeled grountruth samples and the system's results can then be reached; • evaluators could compare their emotional feelings from watching direct search engines result sets with respect to emotionally filtered ones; • users needs for individual contents may be achieved using backpropagation labelling to train classifiers on personalized emotional result sets.iv. a user interface to re-rank (drag and drop) images ordered by the emotional filtering system will take their prefered ranking positions as new labeled training sets.
• enhancement of user personalized results by deletion of images from his displayed results.This will again help the classifier training with backpropagation techniques.
New technologies such as eye-tracking can help analysing user behaviour when observing results from search engines.We already started experiments in our lab.Given two images displayed, users are asked to focus on their prefered one.The analysis of eye-tracking data allows the computer to determine the prefered one.Future works may reveal eye-tracking path patterns linked to emotion impact factors when looking to images.
Big data technologies will also be necessary to perform customizable results for users.Behavior patterns are already extracted from your navigation over the Internet.Merging Internet extracted emotional information to emotional image descriptors may also allow huge enhancements in the efficiency of image search engine filters results.
If all were implemented, Internet could eventually be focused on users and not only on data.

Figure 2 .
Figure 2. Examples of discrete and dimensional emotion classification.

Figure 3 .
Figure 3. SVMs' principle: The two classes (red and blue points) are separated by the SVM hyperplane (black line).Support vectors (rounded points) are the closest samples to the hyperplane.

Figure 6 .
Figure 6.Architecture of the Perreira computational model of attention y − sC x,y I x,y d I x,y dt = sC x,y I x,y + sf P x,y −m I I x,y 04 2016 | Volume 3 | Issue 6 | e4

Figure 7 .
Figure 7. Competitive preys / predators attention model.Singularity maps are the resources that feed a set of preys which are themselves eat by predators.The maximum of the predators map represents the location of the current focus of attention.

Figure 8 .
Figure 8. Emotional filtering model.In this example, only high emotional score images are displayed as a result of the filter

Figure 11 .
Figure 11.Images examples from the our cooking recipes database

Table 1 .
Comparison of common filters proposed by image search engines 2. Dimensional approach: emotions are considered as the result of a fixed number of concepts represented in a dimensional space.The dimensions can be pleasure, arousal or power for instance.They vary depending to the needs of the model.Russel's dimensional model, represented in the Figure Improvement of natural image search engines by emotional filteringFilter date size orientation color file format licence style faces image query textual query

Table 2 .
Extraction approaches on audio documents

Table 3 .
Improvement of natural image search engines by emotional filtering Emotion extraction approaches on textual documents 4 EAI European Alliance for Innovation EAI Endorsed Transactions on Creative Technologies 10 2015 -04 2016 | Volume 3 | Issue 6 | e4

Table 4 .
Emotion extraction approaches on images documents Subsection 3.2.The description and justification of the usage of a saliency model can be read in Subsection 3.3.