Unit 3 | Natural Language Processing Notes | AKTU Notes



    UNIT III: Grammars and Parsing

    3.1 What is a Grammar?

    A grammar is a set of rules that defines which sentences are valid (correct) in a language.

    In NLP, grammars help computers understand the structure of sentences.

    3.2 Grammars and Sentence Structure

    The most commonly used grammar in NLP is Context-Free Grammar (CFG).

    CFG consists of:

    • Terminal symbols: Actual words — "dog", "runs", "the"
    • Non-terminal symbols: Grammatical categories — S, NP, VP, N, V, Det
    • Production rules: Rules that expand non-terminals
    • Start symbol: Usually S (Sentence)

    Example CFG:

    • S → NP VP
    • NP → Det N
    • VP → V NP
    • Det → "the" | "a"
    • N → "dog" | "cat" | "bone"
    • V → "chased" | "ate"

    Parsing Example: "The dog chased a cat"

    • S
    • NP: [Det: the] [N: dog]
    • VP: [V: chased] [NP: [Det: a] [N: cat]]

    This tree is called a Parse Tree.

    3.3 Top-Down Parsing

    In Top-Down Parsing, we start from the Start Symbol (S) and try to expand rules until we match the input sentence.

    Process:

    • Start with S.
    • Apply grammar rules to expand S.
    • Keep expanding until you reach terminal words.
    • Check if they match the input.

    Example: Input: "The dog runs"

    • Start: S
    • Expand: S → NP VP
    • Expand: NP → Det N → "The" "dog"
    • Expand: VP → V → "runs"
    • Match! ✓

    Advantage: Simple to understand and implement.

    Disadvantage:

    • Can get into infinite loops if grammar has left-recursive rules.
    • May try many wrong paths before finding the right one (backtracking).

    3.4 Bottom-Up Parsing

    In Bottom-Up Parsing, we start from the input words and try to combine them using grammar rules until we reach S.

    Process:

    • Start with input words (terminals).
    • Apply grammar rules in reverse to combine them into larger structures.
    • Continue until you get S.

    Example: Input: "The dog runs"

    • "The" → Det
    • "dog" → N
    • Det + N → NP
    • "runs" → V → VP
    • NP + VP → S ✓

    Advantage:

    • Works well with ambiguous grammars.
    • More efficient in many cases.

    Disadvantage:

    • Can generate many irrelevant structures before finding the correct one.

    3.5 Transition Network Grammars (TNG)

    A Transition Network is a type of finite state machine (like a flowchart) used to represent grammar rules.

    How it works:

    • The network has states (nodes) and transitions (arcs).
    • Each arc is labeled with a word category (N, V, Det, etc.).
    • To parse a sentence, you follow the arcs matching the input words.
    • If you reach a final state, the sentence is valid.

    Example (simple sentence):

    • START → [Det] → [N] → [V] → END
    • For "The dog runs": Start → "The" matches Det → "dog" matches N → "runs" matches V → END ✓

    Limitation: Simple transition networks cannot handle complex sentences with embedded clauses.

    3.6 Recursive Transition Networks (RTN)

    RTN extends TNG by allowing recursive calls — one network can call another network.

    This allows handling of complex, nested structures like:

    • "The dog that chased the cat ran away."

    Here, "that chased the cat" is a nested clause. RTN can handle this by recursively calling the sentence network inside the noun phrase network.

    3.7 Top-Down Chart Parsing

    Chart Parsing is a more efficient parsing technique that avoids redoing the same work by storing intermediate results.

    Key Concept — The Chart:

    • A chart is a data structure that records all partial parse results.
    • Each entry (called an edge) records: which rule is being applied, how much has been matched, and the position in the sentence.

    Two types of edges:

    • Active Edge: A rule that has been partially matched (waiting for more).
    • Inactive Edge: A rule that has been fully matched.

    Algorithm:

    • Initialize chart with input words.
    • Apply grammar rules.
    • When a rule is complete, add it as an inactive edge.
    • Use inactive edges to complete active edges.
    • Continue until S is found spanning the whole sentence.

    Advantage: Very efficient — never repeats the same computation. This is because of dynamic programming (memoization).

    3.8 Feature Systems and Augmented Grammars

    Simple CFGs cannot handle all aspects of language, like agreement between subject and verb.

    Problem: CFG cannot easily prevent "The dog run" (should be "runs").

    Solution: Use Feature Systems — attach features (like number, person, tense) to grammar symbols.

    Example Features:

    • Number: singular / plural
    • Person: 1st / 2nd / 3rd
    • Tense: past / present / future
    • Gender: masculine / feminine

    Feature Agreement:

    • Subject and verb must agree in number and person.
    • "The dog runs" (singular) ✓
    • "The dogs run" (plural) ✓
    • "The dog run" ✗ (mismatch — caught by feature checking)

    Augmented Grammar:

    • A grammar where each rule has additional feature constraints.
    • Only rules where features match are applied.

    3.9 Basic Feature System for English

    For English, the basic features needed are:

    FeatureValues
    Number (NUM)singular (sg), plural (pl)
    Person (PER)1st, 2nd, 3rd
    Tense (TNS)past, present, future
    Casenominative, accusative

    Example rule with features:

    • S → NP[NUM=?n] VP[NUM=?n]
    • This means: NP and VP must have the same number (?n is a variable).

    3.10 Morphological Analysis and the Lexicon

    Morphology is the study of word structure — how words are formed from smaller units called morphemes.

    Types of Morphemes:

    • Root/Stem: The base meaning — "play", "run", "happy"
    • Prefix: Added before root — "un" in "unhappy"
    • Suffix: Added after root — "ness" in "happiness", "ed" in "played"
    • Infix: Added inside a word (rare in English)

    Morphological Analysis in NLP involves:

    • Breaking a word into its morphemes.
    • Identifying root form (called lemmatization).
    • Identifying grammatical features.

    Examples:

    • "running" → run + ing (verb, present participle)
    • "dogs" → dog + s (noun, plural)
    • "unhappiness" → un + happy + ness (adjective with prefix and suffix)

    The Lexicon:

    • A lexicon is essentially a dictionary used by an NLP system.
    • It stores: word forms, root form, part of speech, features, meaning.

    3.11 Parsing with Features

    When parsing with features, the parser:

    • Looks up each word in the lexicon to get its features.
    • Applies grammar rules only when features match.
    • Builds a parse tree that respects both structure and feature constraints.

    Example: Input: "She runs fast."

    • "She" → Pronoun [NUM=sg, PER=3rd, CASE=nominative]
    • "runs" → Verb [NUM=sg, PER=3rd, TNS=present]
    • NUM and PER match → Valid! ✓

    3.12 Augmented Transition Networks (ATN)

    An ATN is an advanced version of RTN that includes feature checks and actions on each arc.

    Features of ATN:

    • Each arc can test conditions (features) before being followed.
    • Arcs can have actions — storing information in registers (like subject, object, verb).
    • This allows complex grammatical structures to be parsed.

    Example registers: SUBJECT, VERB, OBJECT, MODIFIER

    When parsing "The cat chased the dog":

    • SUBJECT = "The cat"
    • VERB = "chased"
    • OBJECT = "the dog"

    ATNs are powerful enough to handle most English sentence patterns.

    No comments:

    Post a Comment