CSc 453 University of Arizona ======================================================== Parsing with Derivatives ======================================================== 10/13/16 ============= Some of the notes from last time. Parsing with Derivatives - Punchline: Provide a grammar, ambiguous or not, as data. Can provide input, and the output of a PWD parser is the forest of possible parse trees. - Let’s start with recognizers. A recognizer takes as input a grammar and an input and indicates whether the input belongs in the language. - First we need a test suite Inputs that are in and not in each language. - Grammar A, Start=Stm: Stm -> print NUM - Grammar B, Start = Stm: Stm -> print NUM | read ID - Grammar C, Start = SL: SL -> Stm SL | epsilon, Stm as above - Recognizing with Derivatives, python-like pseudocode G[0] = grammar for language input = list of tokens count = 1 for each token in the input: G[count] = Deriv( G[count-1], token ) count = count + 1 if nullable(G[count]) then accept else reject ================ New Notes for today. --------------- Deriv Examples: Taking the derivative of a grammar wrt a token. - Input: grammar, token - Output: new grammar that describes language of strings in the given grammar with the given token removed off beginning of the string - Examples D_print ( A ) = ? D_print ( B ) = ? D_NUM ( A ) = ? nullable( B ) = ? nullable( C ) = ? D_read( C ) = ? --------------- Deriv Algorithm ------------- Notation empty (empty set) denotes empty language, IOW language with no strings epsilon denotes empty string x and t are variables representing tokens A represents a non-terminal L represents a grammar expression. L can be empty set, epsilon, L1 | L2, and L1 L2. A -> L denotes a production. G denotes a grammar. A grammar is a set of productions. nullable( G ), grammar is nullable if it can produce an empty string - Not fully defining this yet. We will make it more concrete later. - We already intuitively know how to compute it because we have been computing nullable for predictive parsing tables. -------------------- Derivative of something with respect to token t. D_t(empty) = empty D_t(epsilon) = empty D_t(A) = D_t(A) -- not recursing on non-terms to avoid infinite recursion D_t(x) = if x==t then epsilon else empty D_t(L1 | L2) = D_t(L1) | D_t(L2) D_t(L1 L2) = D_t(L1) L2 | Delta(L1) D_t(L2) Delta(L1) = if nullable(L1) then epsilon else empty D_t( Delta( L ) ) = empty D_t( A -> L ) = D_t(A) -> D_t(L) D_t( G ) = { D_t(A->L) | A->L in G } union { A->L | A->L in G and A in RHS of above set { D_t(A->L) | A->L in G } } ------------------ Recognizer Example Some questions we want to answer - Apply this algorithm to the Grammar A, B, and C examples. - What happens when have epsilon concatenated with L? - What happens when have empty set concatenated with L? - What about recursion in Grammar C? G = { SL -> print NUM SL | epsilon } input: print 42 ---------------- D_print( G ) = { D_print( SL -> print NUM SL | epsilon ) } = { D_print( SL ) -> D_print( print NUM SL | epsilon ) } = { D_print( SL ) -> D_print( print NUM SL ) | D_print(epsilon) } // Note recursion of D_print(SL). = { D_print(SL) -> D_print(print) NUM SL | Delta(print) D_print(NUM SL) | empty } = { D_print(SL) -> epsilon NUM SL | Delta(print) D_print(NUM) SL | Delta(print) Delta(NUM) D_print(SL) | empty } = { D_print(SL) -> epsilon NUM SL | empty empty SL | empty empty D_print(SL) | empty } union { SL -> print NUM SL | epsilon } --------------------------------------- BEFORE parsing NUM(42), let's clean things up a bit. ----------------- Empty Set Cleanup alternation: L | empty simplifies to L empty | L simplifies to L concatenation: L empty simplifies to empty empty L simplifies to empty -------------------- D_print( G ) = { D_print(SL) -> epsilon NUM SL, SL -> print NUM SL | epsilon } --------------------------------------- input: print . NUM(42) Now let's parse NUM(42). D_NUM( D_print( G ) ) = { D_NUM( D_print(SL) -> epsilon NUM SL ), D_NUM( SL -> print NUM SL | epsilon ) } = { D_NUM( D_print(SL) ) -> D_NUM( epsilon NUM SL ), D_NUM(SL) -> D_NUM( print NUM SL | epsilon ) } -------------- BEFORE finishing. Let's introduce a notational shortcut. D_PRINT( X ) = X_1 D_PRINT( X_0 ) = X_1 D_NUM( D_PRINT( X_0 ) ) = X_2 X is a non-terminal or grammar. D_NUM( X ) = D_NUM( X ), no shortcut since not full sequence More generally ... D_tok_i ( X_{i-1} ) = X_i ------------------ Back to computing ... G_2 = D_NUM( D_print( G ) ) // copied from above = { D_NUM( D_print(SL) ) -> D_NUM( epsilon NUM SL ), D_NUM(SL) -> D_NUM( print NUM SL | epsilon ) } // rewritten using notation shortcut = { SL_2 -> D_NUM( epsilon NUM SL ), D_NUM(SL) -> D_NUM( print NUM SL | epsilon ) } // continuing computation = { SL_2 -> D_NUM( epsilon ) NUM SL | Delta( epsilon ) D_NUM(NUM SL), D_NUM(SL) -> D_NUM( print ) NUM SL | D_NUM( print NUM SL ) | D_NUM(epsilon ) } ... fill in some of the skipped steps on your own ... G_2 = { SL_2 -> epsilon epsilon SL } union { SL -> print NUM SL | epsilon } Do we accept the input? Is G_2 nullable? ------------------ What is the parse tree? ------------------ Another input: epsilon ------------------------ mstrout@cs.arizona.edu, 10/13/16