Sathvick Views: C36 How AI Understands Language

Natural Language Processing (NLP)

Dr Sudheendra S G provides a detailed overview of Natural Language Processing (NLP) based on the provided teacher script, covering its fundamental concepts, applications, technical components, and ethical considerations.

1. Introduction to NLP

NLP is the field that enables computers to "parse, interpret, and generate natural language." Unlike the precise syntax of programming languages, natural languages are inherently "messy—ambiguous words, accents, missing info." NLP aims to bridge this gap, allowing computers to understand and interact with human language.

Key Learning Goals:

Explain what NLP is and its daily life applications.
Understand the core components of NLP from text processing to speech synthesis.
Discuss limitations and ethical implications (bias, privacy, misuse).

2. Text Processing Fundamentals

2.1 Tokens & Parts of Speech (POS)

Tokenization: The initial step in NLP, where text is split into fundamental units called tokens (words, punctuation, etc.). For example, "The Mongols rose from the leaves." becomes "The | Mongols | rose | from | the | leaves | ."
POS Tagging: Assigns grammatical categories (Noun, Verb, Adjective, etc.) to each token. A single word can have "multiple tags (e.g., leaves)" depending on its context, highlighting that "context matters" for disambiguation.

2.2 Grammar & Parse Trees

Phrase-Structure Rules (CFGs): These rules encode grammar, such as "S → NP VP" (Sentence becomes Noun Phrase followed by Verb Phrase).
Parsers: Build parse trees that visually "expose sentence structure." These trees are crucial for understanding the grammatical relationships within a sentence. Ambiguous sentences, like "I saw the man with a telescope," can yield "two valid trees," demonstrating how "parsing matters" for resolving different meanings.

3. Understanding Language: Intent, Knowledge Graphs, and Chatbots

3.1 Intent, Entities & Slot Filling

Voice queries and user input often map to a specific intent and associated slots (entities).

Intent: The user's goal (e.g., FIND_PLACE, SET_ALARM).
Slots: Specific pieces of information extracted from the utterance (e.g., {food=pizza, constraint=nearest}, {time=2:20}). These structured outputs "feed search, maps, or Q&A systems."

3.2 Knowledge Graphs & Natural Language Generation (NLG)

Knowledge Graphs: Store facts as interconnected triples (subject, relation, object). Examples include ("Thriller", sungBy, "Michael Jackson") and ("Thriller", releaseYear, 1983). These graphs represent factual knowledge in a structured format.
NLG (Natural Language Generation): The process of generating human-readable text. Template-based NLG uses predefined templates to construct sentences from knowledge graph triples, for example, producing "{subject} was released in {year} and {relation} {object}." This contrasts with more advanced "freeform generation."

3.3 Chatbots: From Rules to Machine Learning

Rule-based Chatbots: Early chatbots like ELIZA relied on "rules & pattern matching." While "clever," they were "brittle" and easily failed outside their predefined patterns. An example rule: "If input matches I feel, reply 'Why do you feel {rest}?'"
Machine Learning (ML) Chatbots: Modern systems leverage ML to "learn intents from data (supervised ML) and manage dialog state." This approach is more robust and scalable, processing "text → features → classifier → intent → policy decides response." However, challenges remain with nuances like "sarcasm, slang, long context."

3.4 Language Models (n-grams)

Language Models (LMs): Score sequences of words, predicting the likelihood of a word appearing given its preceding context.
N-grams: Simple LMs that consider only a fixed window of preceding words (e.g., "bigram counts for a tiny corpus; compute P(happy | 'was')"). These models "resolve ambiguities," helping choose between words like "happy" and "harpy" based on probability.
Neural LMs: More advanced models that "capture longer context," leading to improved performance.
Metrics: Perplexity measures LM quality, while BLEU is used for basic text generation evaluation.

4. Speech Technologies

4.1 Speech Recognition

Spectrograms: Audio waveforms are transformed into spectrograms (using FFT), which visualize "time → frequencies; brightness = energy." Different vowels (e.g., "aaaa" vs. "eeee") show distinct patterns called formants.
Phonemes: Speech recognizers detect these fundamental units of sound (approximately 44 in English) and combine them with a language model to convert speech into text.
WER (Word Error Rate): The primary metric for evaluating speech recognition accuracy. Challenges include "coarticulation (sounds blend)."

4.2 Speech Synthesis (Text-to-Speech - TTS)

Concatenative TTS (Older): "Stitched recorded phonemes" together, often resulting in "robotic prosody."
Neural TTS (Modern): "Produces natural rhythm/intonation" using advanced techniques (e.g., sequence-to-mel + vocoder). Despite significant improvements, challenges persist in synthesizing "emotion, style control, names."
Pipeline: Text → G2P (grapheme-to-phoneme) → Prosody → Mel spectrogram → Vocoder → Audio.

5. Ethics & Limitations

NLP, while powerful, presents several ethical challenges and inherent limitations:

5.1 Ethical Risks

Bias: Can arise from "datasets, dialects," leading to unfair or inaccurate outcomes for certain groups (e.g., résumé screeners).
Privacy: Concerns about "always-listening mics" in voice assistants and the collection of personal data.
Misuse: Potential for "impersonation, disinfo" through advanced speech synthesis and text generation.
Consent: Importance of obtaining explicit consent for recordings and data usage.

5.2 Mitigation Strategies

Representative Data: Using diverse and balanced datasets to reduce bias.
Audits: Regularly checking NLP systems for fairness and accuracy across different demographics.
On-device Processing: Performing computations locally to enhance privacy.
Opt-in & Clear Retention: Ensuring users consent to data collection and are informed about data retention policies.
Human-in-the-Loop: Incorporating human oversight to catch errors and ethical issues.

5.3 Common Misconceptions & Limitations

"Parsing = understanding": While parsing aids understanding, "meaning needs context & world knowledge."
"Just add more rules": Rule-based systems are "brittle"; "data-driven models scale better."
"Accuracy is enough": It's crucial to "track fairness across dialects/accents; for ASR use WER by group."
Overpromising: NLP is powerful but "not omniscient; ambiguity and pragmatics remain hard."

6. Conclusion

"NLP turns words → structure → meaning → action—from POS & parse trees to intents, language models, and speech—powerful tools that demand careful, ethical use." This field continues to evolve rapidly, transforming how humans interact with technology, but its development must be guided by a strong awareness of its societal impact and inherent limitations.

Sathvick Views

Sunday, August 24, 2025

C36 How AI Understands Language

No comments:

About Me

Blog Archive