Microsoft Azure for NLP – Fundamentals of Natural Language Processing
Microsoft Azure for NLP
Microsoft Azure provides a wide variety of NLP-related services for application development. Table 5-1 shows the list of cognitive services to be used to build solutions for natural language processing in Microsoft Azure.
Table 5-1List of cognitive services to be used to build NLP solutions in Microsoft Azure
Service | Capabilities |
Language | Language detectionKey phrase extractionEntity detectionSentimental analysisQuestion answeringConversational language understanding |
Speech | Text-to-speechSpeech-to-textSpeech translation |
Translator | Text translation |
Azure Bot Service | Platform for conversational AI |
All of the preceding offerings can be broken down into their core NLP tasks, which will be discussed in the next section.
Core Azure NLP Workloads: Language, Speech, and Translator
Language, speech, and translators are the three main pillars of language processing, which is served by Microsoft Azure NLP Services.
Language
The main things that can be done with language are language detection, key phrase extraction, entity detection, and semantic analysis.
Language Detection
The Azure Cognitive Service for Language is a cloud-based set of machine learning and AI algorithms that can be used to make smart apps that use written language. One of the things it can do is detect languages. Language detection can figure out what language a text is written in and give a language code for a wide range of languages, variations, dialects, and regional or cultural languages.
The first step in any text analysis or natural language processing pipeline is to figure out what language is being used. If the language of a document is not picked up correctly, all language-specific models that follow will give wrong results. If there are mistakes at this step, they can add up and lead to wrong results, like when an English language analyzer is used on a French text. It is critical to determine the language of each document and whether any sections are in another language. Depending on the country and culture, it is pretty common for documents to have more than one language section.
Most of the time, statistical profiles of languages are used to figure out the language of a document that only has one language. Language recognition sorts content into categories and improves search results, especially for multilingual documents or anything with text: social media, image captions, news headlines, email subject lines, tweets, metadata, keywords, queries, files, logs, and more. Basis Technology leads the pack.
Use the Language Service’s language detection tool to find out what language a piece of text is written in. You may submit numerous documents for analysis at the same time. The service will detect the following for all documents that are submitted to it:
- Language title (e.g., “English”).
- ISO 6391 is the language code (e.g., “en”).
- A confidence score for language detection.
Consider the following scenario: you own a restaurant where guests can fill out questionnaires and provide comments on the cuisine, service, and personnel, among other things. Assume you’ve received the following client feedback:
- Review 1: “A wonderful lunch spot. The soup was fantastic.”
- Review 2: “Excellent food and service.”
- Review 3: “The croque monsieur with frites was excellent. Good appetite!”
You can use the Language Service’s text analytics to find out what language each of these evaluations was written in, and it might give you the results in Table 5-2.
Table 5-2Result of the Language Service’s text analytics
Document | Language Name | ISO 6391 Code | Score | |
Review 1 | English | en | 1.0 | |
Review 2 | Spanish | Es | 1.0 | |
Review 3 | English | En | 0.9 |
Even though the content is in both English and French, English is the language that Review 3 recognizes. The primary language of the text will be the focus of the language detection service. The service employs a system to determine the dominant language based on variables such as phrase length or the percentage of text in the language. The prevailing language will be indicated by the value and language code returned. The mixed-language content may cause the confidence score to be below one.