To Index

 Documented in Volume 1 of the UNIX Programmers Manual.
 See /usr/doc/lex on 4.2 BSD systems for additional online documentation on this.

 From martin@noscvax.UUCP (Douglas W. Martin)    arpanet: martin@nosc
 Subject: lex(1) output statistics (Summary)
     ... I requested info explaining lex output statistics; e.g. the meanings
 of the "%" parameters, what sorts of rules caused which tables to increase
 in size, etc...  A summary of responses follows.

 	From: (Vern Paxson [rtsg]) vern@lbl-csam

 >I can tell you what some of the mysterious %{e,p,n,...}'s in the Lex
 >output are:
 >
 >    %e is the number of NFA states needed to recognize the input patterns.
 >       It is a linear function of the number of characters and operators in
 >       the patterns.
 >
 >    %n is the number of DFA states needed for the recognizer.  It is
 >       potentially an exponential function of the number of NFA states,
 >       but in practice tends to be no greater and usually less than the
 >       %e figure.
 >
 >    %t is the number of state transitions in the NFA.  It is a function of
 >       the complexity of your patterns (especially "*", and "+" operators)
 >       and especially the occurence of character classes inside of closures.
 >       One character class and closure is enough to do it.  For example,
 >       "keyword recognizer" inputs like:

 	   begin	return( TOK_BEGIN );
 	   end		return( TOK_END );
 	   procedure	return( TOK_PROCEDURE );
 	   .
 	   .
 	   .
 	   [a-z]+	return( TOK_IDENTIFIER );

 >       will result in a very high %t figure.  It's Lex's job to compress
 >       all of these transitions into something much smaller, but it doesn't
 >       do an especially good job.  We have written our own Lex program in
 >       Ratfor and our experience shows that you can halve the size of the
 >       tables for complicated sets of rules.
 >
 >    %o is basically the compressed version of %t.  The total amount of
 >       memory the tables generated by Lex will take is
 >           3*(%n) + 2*(%o) + X  integer words
 >       where X is roughly 2*(%n).  Most of this stuff could be declared
 >       as short but isn't.  In addition, there's a constant overhead of
 >       two arrays of 128 characters each.
 >
 >    I could make guesses as to what %k, %p, and %a are, but all they'd be
 >is guesses.
 >    If your concern is in staying within Lex's default maximums for these
 >values, the way to do it is to eliminate ambiguities and similarities between
 >rules.  The above example of rules that recognize keywords and identifiers
 >will use a great deal less of Lex's resources (and have much smaller output
 >tables) if you get rid of either the keyword rules or the identifier rules.
 >The complexity in the set of rules arises from the ambiguity of each
 >keyword also matching the identifier rule.
 >    But if you want to keep your input rules intact, you can increase Lex's
 >maximum for a given value by putting, e.g,

 %a 3000

 >in the declarations section of the input.  In picking how much to raise
 >the value, I just increase by 500 or 1000 and see if that's enough.  Judging
 >from the operation of our Lex, I don't think any of these maximums have
 >a geometric effect on the amount of memory Lex uses.  They may very easily
 >have a geometric effect on Lex's run-time, though.
 >  Vern Paxson, Real Time Systems Group, Lawrence Berkeley Laboratory
 >  vern@lbl-csam.ARPA

 		From: Don Davis JHU/APL

 >I was confronted with this problem about one year ago.  I had to
 >increase the table parameters (I forget the name of the file, but
 >it was a .h file) and recompile lex.  I called my new lex "biglex" and
 >it worked just fine.
 >  How do you determine the correct size for the table parameters?  I didn't
 >have the time or inclination to analyze the operation of lex, so I just
 >used trial and error.  What can I say; it worked!
 >	Don Davis JHU/APL
 >	

To Index