Detecting emotions in selfies with Azure Cognitive Services

Using Azure Cognitive Services to perform emotion analysis on selfies.

Published in

Slalom Technology

8 min readMay 14, 2018

In the age of Artificial Intelligence (AI), humans are increasingly relying on communication with computer systems to accomplish various tasks. In order for this communication to be effective, computers will need to “understand” human emotion. Understanding whether an individual has a positive or negative state of mind in a given moment could make a huge difference in the quality of that human-machine interaction. In this human-machine communication, facial expressions play a very important role. The human face is the richest source of emotions. Just imagine a world where you can convert your selfie into an emoji based on the emotions displayed in your facial expression, or use that selfie to leave a restaurant review that captures perfectly how you felt at the end of your meal there. Today’s customers expect that apps are intelligent enough to know their intent and desire with less input and fewer taps.

AI is the field that studies algorithms that are capable of automatically learning from data and making predictions based on that data. In this field, Machine Learning(ML) and Deep Learning are two of the fastest growing and most exciting areas today. ML takes some of the core ideas of AI and focuses them on solving real-world problems with algorithms designed to mimic our own decision-making. Deep Learning, on the other hand, focuses even more narrowly on a subset of ML tools and techniques, and applies them to solving just about any problem which requires “thought” — human or artificial.

Machine Learning and Deep Learning have emerged as one of the most disruptive forces behind digital transformation and they are revolutionizing the way we work and live. The exciting news is that developers can take advantage of multiple AI frameworks and APIs to explore the power of Machine Learning. These APIs allow you to have access to powerful machine learning features at your fingertips without building and training any models of your own.

In this article, we will focus on Microsoft Cognitive Services APIs which offer easy-to-use models that are trained on vast repositories of data to offer solutions for common use cases. Depending on the data type they analyze, these technologies are grouped into six categories:

Vision: Vision APIs analyze visual content (images and video) identifying objects, recognizing faces and emotions. Vision algorithms allow the implementation of face authentication into apps, and the creation of services that can group faces according to some characteristics or guess the age of a person in a photo.
Speech: Speech APIs implement speech processing in apps: they convert speech to text and vice versa, translate text to other languages, and identify speakers. The technology can be used for hands-free tools to dictate text or to read instructions out loud.
Language: Language APIs analyze natural language, sentiment, and check spelling.
Knowledge: Knowledge APIs analyze data to discover relationships and patterns to complete tasks such as recommendations or query auto-completion.
Search: Search APIs are integrated with the Bing search engine and include: Bing Auto-suggest API, Bing News Search API, Bing Web Search API, etc.
Labs: Developers have a chance to use experimental technologies that are still under development. If they don’t need a market-ready technology, they can adopt the experimental techniques, try them and provide feedback on the new Microsoft cognitive computing services before they are generally available.

Source: Microsoft Azure

In an effort to address the human-machine communication effectiveness problem, I will demonstrate how we can leverage Cognitive Service’s Vision API to analyze the emotions displayed on a given selfie.

Detecting Emotions

Taking full advantage of Microsoft technology solutions, we will be using the Microsoft mobile development framework Xamarin to create a mobile app. This app will allow the user to either take a selfie or choose an existing one. The selfie is then processed through the Vision API, which will look for emotions such as: happiness, sadness, anger, disgust, etc and then give each one a percentage score.

The result will look something like this:

Cognitive Services: AI as a Service

The Vision API is a core part of Microsoft Cognitive Services. It is a web service that Microsoft hosts, but what does it mean? Put simply, the Vision API is a service running inside of Azure. You can submit an image to it, and it will return back a JSON object describing what it saw in the image.

There are three fundamental aspects that can be leveraged with the Vision API. These are:

Image classification and identification: This feature returns information about visual content found in an image. These features can be grouped into two categories: content detection and content categorization. The Vision API returns tags based on more than 2,000 recognizable objects, such as people, scenery, and actions. In addition to tagging, it is also able to return taxonomy-based categories based on a list of 86 concepts, such as faces, food, nature, and abstract. It is also able to detect if an image is black and white, a line drawing, a color picture, etc.
Thumbnail generation: Generate a high quality storage-efficient thumbnail based on any input image. Use thumbnail generation to modify images to best suit your needs for size, shape, and style. Apply smart cropping to generate thumbnails that differ from the aspect ratio of your original image, yet preserve the region of interest.
Text recognition: refers mostly to optical character recognition, OCR. This technology identifies text content in an image and extracts the identified text into a machine-readable string. It supports text recognition in 25 languages, and is also able to correct text rotation on an image. In addition to OCR, which refers to machine-printed text, the Vision API is also able to recognize handwritten text from notes, letters, essays, and forms.

Each of these aspects provides a different set of features that add a great level of capability in regards to digital processing of images. Just like with any other API, there are some requirements in order to use it. The supported input method is by either using a binary image or an image URL. The image can be in either JPEG, PNG, GIF, or BMP format. The file size must be less than 4 MB, and the image that I mentioned must be bigger than 50 x 50 pixels.

One of the many cool things that this API can also do is Face detection. Not only can it detect human faces on images and return the face coordinates, gender, and approximate age, it can also detect emotions based on facial expressions displayed in the image. Using powerful AI algorithms, it attributes percentages to human emotions such as: happiness, sadness, fear, anger, disgust, surprise, contempt, etc. present in a given photo.

Integrating Cognitive Services into your app means that it can now see, hear and recognize emotions in your interactions with it. Microsoft’s tagline for Cognitive Services is “Give Your Apps a Human Side”, and it does just that.

Getting Started with Cognitive Services

Signing up:

To get started using the Cognitive Service APIs, the first thing you would need is to subscribe to the Vision API service in Azure. Once you have signed in to your Microsoft account and subscribed to the Vision API .You will have an account key as shown in the screenshot below, you will need this key later on to call the Vision API.

Integrating Cognitive Services into the app:

Without going into much implementation detail, here are the main tasks we need to accomplish in order to integrate Cognitive Services into the app:

Include the following Nuget packages in the solution:

Microsoft.ProjectOxford.Vision and Microsoft.ProjectOxford.Face

To call the vision API we simply need a POST request. We can either pass the image URL or the whole image as raw bytes.
Initialize the VisionClient instance, we will need to pass in the subscription key we saved.
The JSON response will look something like this:

[
  {
    "faceId": "6b854442-634e-4d46-9091-00165bf12e3e",
    "faceRectangle": {
      "top": 128,
      "left": 459,
      "width": 224,
      "height": 224
    },
    "faceAttributes": {
      "hair": {
        "bald": 0.0,
        "invisible": false,
        "hairColor": [
          {
            "color": "brown",
            "confidence": 1.0
          },
          {
            "color": "blond",
            "confidence": 0.69
          },
          {
            "color": "black",
            "confidence": 0.54
          },
          {
            "color": "other",
            "confidence": 0.31
          },
          {
            "color": "gray",
            "confidence": 0.05
          },
          {
            "color": "red",
            "confidence": 0.04
          }
        ]
      },
      "smile": 0.639,
      "headPose": {
        "pitch": 0.0,
        "roll": -16.9,
        "yaw": 16.7
      },
      "gender": "male",
      "age": 27.4,
      "facialHair": {
        "moustache": 0.0,
        "beard": 0.0,
        "sideburns": 0.0
      },
      "glasses": "ReadingGlasses",
      "makeup": {
        "eyeMakeup": true,
        "lipMakeup": true
      },
      "emotion": {
        "anger": 0.015,
        "contempt": 0.001,
        "disgust": 0.037,
        "fear": 0.001,
        "happiness": 0.939,
        "neutral": 0.001,
        "sadness": 0.0,
        "surprise": 0.007
      },
      "occlusion": {
        "foreheadOccluded": false,
        "eyeOccluded": false,
        "mouthOccluded": false
      },
      "accessories": [
        {
          "type": "glasses",
          "confidence": 0.99
        }
      ],
      "blur": {
        "blurLevel": "low",
        "value": 0.0
      },
      "exposure": {
        "exposureLevel": "goodExposure",
        "value": 0.48
      },
      "noise": {
        "noiseLevel": "low",
        "value": 0.0
      }
    },
    "faceLandmarks": {
      "pupilLeft": {
        "x": 504.8,
        "y": 206.8
      },
      "pupilRight": {
        "x": 602.5,
        "y": 178.4
      },
      "noseTip": {
        "x": 593.5,
        "y": 247.3
      },
      "mouthLeft": {
        "x": 529.8,
        "y": 300.5
      },
      "mouthRight": {
        "x": 626.0,
        "y": 277.3
      },
      "eyebrowLeftOuter": {
        "x": 461.0,
        "y": 186.8
      },
      "eyebrowLeftInner": {
        "x": 541.9,
        "y": 178.9
      },
      "eyeLeftOuter": {
        "x": 490.9,
        "y": 209.0
      },
      "eyeLeftTop": {
        "x": 509.1,
        "y": 199.5
      },
      "eyeLeftBottom": {
        "x": 509.3,
        "y": 213.9
      },
      "eyeLeftInner": {
        "x": 529.0,
        "y": 205.0
      },
      "eyebrowRightInner": {
        "x": 579.2,
        "y": 169.2
      },
      "eyebrowRightOuter": {
        "x": 633.0,
        "y": 136.4
      },
      "eyeRightInner": {
        "x": 590.5,
        "y": 184.5
      },
      "eyeRightTop": {
        "x": 604.2,
        "y": 171.5
      },
      "eyeRightBottom": {
        "x": 608.4,
        "y": 184.0
      },
      "eyeRightOuter": {
        "x": 623.8,
        "y": 173.7
      },
      "noseRootLeft": {
        "x": 549.8,
        "y": 200.3
      },
      "noseRootRight": {
        "x": 580.7,
        "y": 192.3
      },
      "noseLeftAlarTop": {
        "x": 557.2,
        "y": 234.6
      },
      "noseRightAlarTop": {
        "x": 603.2,
        "y": 225.1
      },
      "noseLeftAlarOutTip": {
        "x": 545.4,
        "y": 255.5
      },
      "noseRightAlarOutTip": {
        "x": 615.9,
        "y": 239.5
      },
      "upperLipTop": {
        "x": 591.1,
        "y": 278.4
      },
      "upperLipBottom": {
        "x": 593.2,
        "y": 288.7
      },
      "underLipTop": {
        "x": 597.1,
        "y": 308.0
      },
      "underLipBottom": {
        "x": 600.3,
        "y": 324.8
      }
    }
  }
]

In this JSON result, the emotion object contains scores for each detected emotion:

"emotion": {
        "anger": 0.037,
        "contempt": 0.001,
        "disgust": 0.015,
        "fear": 0.001,
        "happiness": 0.939,
        "neutral": 0.001,
        "sadness": 0.0,
        "surprise": 0.007
      }

Conclusion:

Cognitive Services adds intelligent communication to web and mobile experiences, leveraging speech, vision, and text-based technologies. These experiences work to close the gap between human and machine interaction. Closing this gap allows organizations to scale sales and customer service efforts, enhance brand awareness and loyalty while gleaning some of the most insightful data available today.

It’s nearly impossible to have a conversation about emerging technologies today without mentioning artificial intelligence in some form. AI capabilities that were once considered science fiction have now become a reality through the efforts of companies like Microsoft. The company’s AI mission states they are “democratizing AI for every person and every organization.” Through numerous Microsoft tools, developers are bringing AI to the masses.

Moving beyond just “user friendly”, applications need to be designed with natural forms of language and actions in mind. Instilling artificial intelligence into applications breaks down the barriers of communication to enhance productivity and engagement.