1 package PPI::Tokenizer;
7 PPI::Tokenizer - The Perl Document Tokenizer
11 # Create a tokenizer for a file, array or string
12 $Tokenizer = PPI::Tokenizer->new( 'filename.pl' );
13 $Tokenizer = PPI::Tokenizer->new( \@lines );
14 $Tokenizer = PPI::Tokenizer->new( \$source );
16 # Return all the tokens for the document
17 my $tokens = $Tokenizer->all_tokens;
19 # Or we can use it as an iterator
20 while ( my $Token = $Tokenizer->get_token ) {
21 print "Found token '$Token'\n";
24 # If we REALLY need to manually nudge the cursor, you
25 # can do that to (The lexer needs this ability to do rollbacks)
26 $is_incremented = $Tokenizer->increment_cursor;
27 $is_decremented = $Tokenizer->decrement_cursor;
31 PPI::Tokenizer is the class that provides Tokenizer objects for use in
32 breaking strings of Perl source code into Tokens.
34 By the time you are reading this, you probably need to know a little
35 about the difference between how perl parses Perl "code" and how PPI
36 parsers Perl "documents".
38 "perl" itself (the interpreter) uses a heavily modified lex specification
39 to specify its parsing logic, maintains several types of state as it
40 goes, and incrementally tokenizes, lexes AND EXECUTES at the same time.
42 In fact, it is provably impossible to use perl's parsing method without
43 simultaneously executing code. A formal mathematical proof has been
44 published demonstrating the method.
46 This is where the truism "Only perl can parse Perl" comes from.
48 PPI uses a completely different approach by abandoning the (impossible)
49 ability to parse Perl the same way that the interpreter does, and instead
50 parsing the source as a document, using a document structure independantly
51 derived from the Perl documentation and approximating the perl interpreter
52 interpretation as closely as possible.
54 It was touch and go for a long time whether we could get it close enough,
55 but in the end it turned out that it could be done.
57 In this approach, the tokenizer C<PPI::Tokenizer> is implemented separately
58 from the lexer L<PPI::Lexer>.
60 The job of C<PPI::Tokenizer> is to take pure source as a string and break it
61 up into a stream/set of tokens, and contains most of the "black magic" used
62 in PPI. By comparison, the lexer implements a relatively straight forward
63 tree structure, and has an implementation that is uncomplicated (compared
64 to the insanity in the tokenizer at least).
66 The Tokenizer uses an immense amount of heuristics, guessing and cruft,
67 supported by a very B<VERY> flexible internal API, but fortunately it was
68 possible to largely encapsulate the black magic, so there is not a lot that
69 gets exposed to people using the C<PPI::Tokenizer> itself.
73 Despite the incredible complexity, the Tokenizer itself only exposes a
74 relatively small number of methods, with most of the complexity implemented
79 # Make sure everything we need is loaded so
80 # we don't have to go and load all of PPI.
82 use Params::Util qw{_INSTANCE _SCALAR0 _ARRAY0};
83 use List::MoreUtils ();
87 use PPI::Exception ();
88 use PPI::Exception::ParserRejection ();
90 use vars qw{$VERSION};
99 #####################################################################
100 # Creation and Initialization
104 =head2 new $file | \@lines | \$source
106 The main C<new> constructor creates a new Tokenizer object. These
107 objects have no configuration parameters, and can only be used once,
108 to tokenize a single perl source file.
110 It takes as argument either a normal scalar containing source code,
111 a reference to a scalar containing source code, or a reference to an
112 ARRAY containing newline-terminated lines of source code.
114 Returns a new C<PPI::Tokenizer> object on success, or throws a
115 L<PPI::Exception> exception on error.
120 my $class = ref($_[0]) || $_[0];
122 # Create the empty tokenizer struct
126 source_bytes => undef,
130 line_length => undef,
131 line_cursor => undef,
136 class => 'PPI::Token::BOM',
137 zone => 'PPI::Token::Whitespace',
139 # Output token buffer
148 if ( ! defined $_[1] ) {
149 # We weren't given anything
150 PPI::Exception->throw("No source provided to Tokenizer");
152 } elsif ( ! ref $_[1] ) {
153 my $source = PPI::Util::_slurp($_[1]);
155 # Content returned by reference
156 $self->{source} = $$source;
158 # Errors returned as a string
162 } elsif ( _SCALAR0($_[1]) ) {
163 $self->{source} = ${$_[1]};
165 } elsif ( _ARRAY0($_[1]) ) {
166 $self->{source} = join '', map { "\n" } @{$_[1]};
169 # We don't support whatever this is
170 PPI::Exception->throw(ref($_[1]) . " is not supported as a source provider");
173 # We can't handle a null string
174 $self->{source_bytes} = length $self->{source};
175 if ( $self->{source_bytes} > 1048576 ) {
176 # Dammit! It's ALWAYS the "Perl" modules larger than a
177 # meg that seems to blow up the Tokenizer/Lexer.
178 # Nobody actually writes real programs larger than a meg
179 # Perl::Tidy (the largest) is only 800k.
180 # It is always these idiots with massive Data::Dumper
181 # structs or huge RecDescent parser.
182 PPI::Exception::ParserRejection->throw("File is too large");
184 } elsif ( $self->{source_bytes} ) {
185 # Split on local newlines
186 $self->{source} =~ s/(?:\015{1,2}\012|\015|\012)/\n/g;
187 $self->{source} = [ split /(?<=\n)/, $self->{source} ];
190 $self->{source} = [ ];
194 # I'm explaining this earlier than I should so you can understand
195 # why I'm about to do something that looks very strange. There's
196 # a problem with the Tokenizer, in that tokens tend to change
197 # classes as each letter is added, but they don't get allocated
198 # their definite final class until the "end" of the token, the
199 # detection of which occurs in about a hundred different places,
200 # all through various crufty code (that triples the speed).
202 # However, in general, this does not apply to tokens in which a
203 # whitespace character is valid, such as comments, whitespace and
206 # So what we do is add a space to the end of the source. This
207 # triggers normal "end of token" functionality for all cases. Then,
208 # once the tokenizer hits end of file, it examines the last token to
209 # manually either remove the ' ' token, or chop it off the end of
210 # a longer one in which the space would be valid.
211 if ( List::MoreUtils::any { /^__(?:DATA|END)__\s*$/ } @{$self->{source}} ) {
212 $self->{source_eof_chop} = '';
213 } elsif ( ! defined $self->{source}->[0] ) {
214 $self->{source_eof_chop} = '';
215 } elsif ( $self->{source}->[-1] =~ /\s$/ ) {
216 $self->{source_eof_chop} = '';
218 $self->{source_eof_chop} = 1;
219 $self->{source}->[-1] .= ' ';
229 #####################################################################
230 # Main Public Methods
236 When using the PPI::Tokenizer object as an iterator, the C<get_token>
237 method is the primary method that is used. It increments the cursor
238 and returns the next Token in the output array.
240 The actual parsing of the file is done only as-needed, and a line at
241 a time. When C<get_token> hits the end of the token array, it will
242 cause the parser to pull in the next line and parse it, continuing
243 as needed until there are more tokens on the output array that
244 get_token can then return.
246 This means that a number of Tokenizer objects can be created, and
247 won't consume significant CPU until you actually begin to pull tokens
250 Return a L<PPI::Token> object on success, C<0> if the Tokenizer had
251 reached the end of the file, or C<undef> on error.
259 if ( $self->{token_eof}
260 and $self->{token_cursor} > scalar @{$self->{tokens}}
265 # Return the next token if we can
266 if ( my $token = $self->{tokens}->[ $self->{token_cursor} ] ) {
267 $self->{token_cursor}++;
273 # Catch exceptions and return undef, so that we
274 # can start to convert code to exception-based code.
276 # No token, we need to get some more
277 while ( $line_rv = $self->_process_next_line ) {
278 # If there is something in the buffer, return it
279 # The defined() prevents a ton of calls to PPI::Util::TRUE
280 if ( defined( my $token = $self->{tokens}->[ $self->{token_cursor} ] ) ) {
281 $self->{token_cursor}++;
288 if ( _INSTANCE($@, 'PPI::Exception') ) {
292 $errstr =~ s/^(.*) at line .+$/$1/;
293 PPI::Exception->throw( $errstr );
299 if ( defined $line_rv ) {
300 # End of file, but we can still return things from the buffer
301 if ( my $token = $self->{tokens}->[ $self->{token_cursor} ] ) {
302 $self->{token_cursor}++;
306 # Set our token end of file flag
307 $self->{token_eof} = 1;
311 # Error, pass it up to our caller
319 When not being used as an iterator, the C<all_tokens> method tells
320 the Tokenizer to parse the entire file and return all of the tokens
321 in a single ARRAY reference.
323 It should be noted that C<all_tokens> does B<NOT> interfere with the
324 use of the Tokenizer object as an iterator (does not modify the token
325 cursor) and use of the two different mechanisms can be mixed safely.
327 Returns a reference to an ARRAY of L<PPI::Token> objects on success
328 or throws an exception on error.
335 # Catch exceptions and return undef, so that we
336 # can start to convert code to exception-based code.
338 # Process lines until we get EOF
339 unless ( $self->{token_eof} ) {
341 while ( $rv = $self->_process_next_line ) {}
342 unless ( defined $rv ) {
343 PPI::Exception->throw("Error while processing source");
346 # Clean up the end of the tokenizer
352 $errstr =~ s/^(.*) at line .+$/$1/;
353 PPI::Exception->throw( $errstr );
356 # End of file, return a copy of the token array.
357 return [ @{$self->{tokens}} ];
362 =head2 increment_cursor
364 Although exposed as a public method, C<increment_method> is implemented
365 for expert use only, when writing lexers or other components that work
366 directly on token streams.
368 It manually increments the token cursor forward through the file, in effect
369 "skipping" the next token.
371 Return true if the cursor is incremented, C<0> if already at the end of
372 the file, or C<undef> on error.
376 sub increment_cursor {
377 # Do this via the get_token method, which makes sure there
378 # is actually a token there to move to.
379 $_[0]->get_token and 1;
384 =head2 decrement_cursor
386 Although exposed as a public method, C<decrement_method> is implemented
387 for expert use only, when writing lexers or other components that work
388 directly on token streams.
390 It manually decrements the token cursor backwards through the file, in
391 effect "rolling back" the token stream. And indeed that is what it is
392 primarily intended for, when the component that is consuming the token
393 stream needs to implement some sort of "roll back" feature in its use
396 Return true if the cursor is decremented, C<0> if already at the
397 beginning of the file, or C<undef> on error.
401 sub decrement_cursor {
404 # Check for the beginning of the file
405 return 0 unless $self->{token_cursor};
407 # Decrement the token cursor
408 $self->{token_eof} = 0;
409 --$self->{token_cursor};
416 #####################################################################
417 # Working With Source
419 # Fetches the next line from the input line buffer
420 # Returns undef at EOF.
423 return undef unless $self->{source}; # EOF hit previously
425 # Pull off the next line
426 my $line = shift @{$self->{source}};
428 # Flag EOF if we hit it
429 $self->{source} = undef unless defined $line;
431 # Return the line (or EOF flag)
432 return $line; # string or undef
435 # Fetches the next line, ready to process
436 # Returns 1 on success
443 my $line = $self->_get_line;
444 unless ( defined $line ) {
447 delete $self->{line};
448 delete $self->{line_cursor};
449 delete $self->{line_length};
453 # In the scan version, just set the cursor to the end
454 # of the line, and the rest should just cascade out.
455 $self->{line_cursor} = $self->{line_length};
459 # Populate the appropriate variables
460 $self->{line} = $line;
461 $self->{line_cursor} = -1;
462 $self->{line_length} = length $line;
463 $self->{line_count}++;
468 # Get the current character
471 substr( $self->{line}, $self->{line_cursor}, 1 );
478 ####################################################################
479 # Per line processing methods
481 # Processes the next line
482 # Returns 1 on success completion
484 # Returns undef on error
485 sub _process_next_line {
488 # Fill the line buffer
490 unless ( $rv = $self->_fill_line ) {
491 return undef unless defined $rv;
493 # End of file, finalize last token
494 $self->_finalize_token;
498 # Run the __TOKENIZER__on_line_start
499 $rv = $self->{class}->__TOKENIZER__on_line_start( $self );
501 # If there are no more source lines, then clean up
502 if ( ref $self->{source} eq 'ARRAY' and ! @{$self->{source}} ) {
506 # Defined but false means next line
507 return 1 if defined $rv;
508 PPI::Exception->throw("Error at line $self->{line_count}");
511 # If we can't deal with the entire line, process char by char
512 while ( $rv = $self->_process_next_char ) {}
513 unless ( defined $rv ) {
514 PPI::Exception->throw("Error at line $self->{line_count}, character $self->{line_cursor}");
517 # Trigger any action that needs to happen at the end of a line
518 $self->{class}->__TOKENIZER__on_line_end( $self );
520 # If there are no more source lines, then clean up
521 unless ( ref($self->{source}) eq 'ARRAY' and @{$self->{source}} ) {
522 return $self->_clean_eof;
532 #####################################################################
533 # Per-character processing methods
535 # Process on a per-character basis.
536 # Note that due the the high number of times this gets
537 # called, it has been fairly heavily in-lined, so the code
538 # might look a bit ugly and duplicated.
539 sub _process_next_char {
542 ### FIXME - This checks for a screwed up condition that triggers
543 ### several warnings, amoungst other things.
544 if ( ! defined $self->{line_cursor} or ! defined $self->{line_length} ) {
549 # Increment the counter and check for end of line
550 return 0 if ++$self->{line_cursor} >= $self->{line_length};
552 # Pass control to the token class
554 unless ( $result = $self->{class}->__TOKENIZER__on_char( $self ) ) {
555 # undef is error. 0 is "Did stuff ourself, you don't have to do anything"
556 return defined $result ? 1 : undef;
559 # We will need the value of the current character
560 my $char = substr( $self->{line}, $self->{line_cursor}, 1 );
561 if ( $result eq '1' ) {
562 # If __TOKENIZER__on_char returns 1, it is signaling that it thinks that
563 # the character is part of it.
566 if ( defined $self->{token} ) {
567 $self->{token}->{content} .= $char;
569 defined($self->{token} = $self->{class}->new($char)) or return undef;
575 # We have been provided with the name of a class
576 if ( $self->{class} ne "PPI::Token::$result" ) {
578 $self->_new_token( $result, $char );
579 } elsif ( defined $self->{token} ) {
580 # Same class as current
581 $self->{token}->{content} .= $char;
583 # Same class, but no current
584 defined($self->{token} = $self->{class}->new($char)) or return undef;
594 #####################################################################
595 # Altering Tokens in Tokenizer
597 # Finish the end of a token.
598 # Returns the resulting parse class as a convenience.
599 sub _finalize_token {
601 return $self->{class} unless defined $self->{token};
603 # Add the token to the token buffer
604 push @{ $self->{tokens} }, $self->{token};
605 $self->{token} = undef;
607 # Return the parse class to that of the zone we are in
608 $self->{class} = $self->{zone};
611 # Creates a new token and sets it in the tokenizer
612 # The defined() in here prevent a ton of calls to PPI::Util::TRUE
615 # throw PPI::Exception() unless @_;
616 my $class = substr( $_[0], 0, 12 ) eq 'PPI::Token::'
617 ? shift : 'PPI::Token::' . shift;
619 # Finalize any existing token
620 $self->_finalize_token if defined $self->{token};
622 # Create the new token and update the parse class
623 defined($self->{token} = $class->new($_[0])) or PPI::Exception->throw;
624 $self->{class} = $class;
629 # At the end of the file, we need to clean up the results of the erroneous
630 # space that we inserted at the beginning of the process.
634 # Finish any partially completed token
635 $self->_finalize_token if $self->{token};
637 # Find the last token, and if it has no content, kill it.
638 # There appears to be some evidence that such "null tokens" are
639 # somehow getting created accidentally.
640 my $last_token = $self->{tokens}->[ -1 ];
641 unless ( length $last_token->{content} ) {
642 pop @{$self->{tokens}};
645 # Now, if the last character of the last token is a space we added,
646 # chop it off, deleting the token if there's nothing else left.
647 if ( $self->{source_eof_chop} ) {
648 $last_token = $self->{tokens}->[ -1 ];
649 $last_token->{content} =~ s/ $//;
650 unless ( length $last_token->{content} ) {
652 pop @{$self->{tokens}};
655 # The hack involving adding an extra space is now reversed, and
656 # now nobody will ever know. The perfect crime!
657 $self->{source_eof_chop} = '';
667 #####################################################################
672 $_[0]->{tokens}->[-1];
675 sub _last_significant_token {
677 my $cursor = $#{ $self->{tokens} };
678 while ( $cursor >= 0 ) {
679 my $token = $self->{tokens}->[$cursor--];
680 return $token if $token->significant;
684 PPI::Token::Whitespace->null;
687 # Get an array ref of previous significant tokens.
688 # Like _last_significant_token except it gets more than just one token
689 # Returns array ref on success.
690 # Returns 0 on not enough tokens
691 sub _previous_significant_tokens {
693 my $count = shift || 1;
694 my $cursor = $#{ $self->{tokens} };
696 my ($token, @tokens);
697 while ( $cursor >= 0 ) {
698 $token = $self->{tokens}->[$cursor--];
699 if ( $token->significant ) {
700 push @tokens, $token;
701 return \@tokens if scalar @tokens >= $count;
706 foreach ( 1 .. ($count - scalar @tokens) ) {
707 push @tokens, PPI::Token::Whitespace->null;
713 my %OBVIOUS_CLASS = (
714 'PPI::Token::Symbol' => 'operator',
715 'PPI::Token::Magic' => 'operator',
716 'PPI::Token::Number' => 'operator',
717 'PPI::Token::ArrayIndex' => 'operator',
718 'PPI::Token::Quote::Double' => 'operator',
719 'PPI::Token::Quote::Interpolate' => 'operator',
720 'PPI::Token::Quote::Literal' => 'operator',
721 'PPI::Token::Quote::Single' => 'operator',
722 'PPI::Token::QuoteLike::Backtick' => 'operator',
723 'PPI::Token::QuoteLike::Command' => 'operator',
724 'PPI::Token::QuoteLike::Readline' => 'operator',
725 'PPI::Token::QuoteLike::Regexp' => 'operator',
726 'PPI::Token::QuoteLike::Words' => 'operator',
729 my %OBVIOUS_CONTENT = (
737 # Try to determine operator/operand context, is possible.
738 # Returns "operator", "operand", or "" if unknown.
741 my $tokens = $self->_previous_significant_tokens(1);
742 my $p0 = $tokens->[0];
745 # Map the obvious cases
746 return $OBVIOUS_CLASS{$c0} if defined $OBVIOUS_CLASS{$c0};
747 return $OBVIOUS_CONTENT{$p0} if defined $OBVIOUS_CONTENT{$p0};
749 # Most of the time after an operator, we are an operand
750 return 'operand' if $p0->isa('PPI::Token::Operator');
752 # If there's NOTHING, it's operand
753 return 'operand' if $p0->content eq '';
755 # Otherwise, we don't know
765 =head2 How the Tokenizer Works
767 Understanding the Tokenizer is not for the feint-hearted. It is by far
768 the most complex and twisty piece of perl I've ever written that is actually
769 still built properly and isn't a terrible spaghetti-like mess. In fact, you
770 probably want to skip this section.
772 But if you really want to understand, well then here goes.
774 =head2 Source Input and Clean Up
776 The Tokenizer starts by taking source in a variety of forms, sucking it
777 all in and merging into one big string, and doing our own internal line
778 split, using a "universal line separator" which allows the Tokenizer to
779 take source for any platform (and even supports a few known types of
780 broken newlines caused by mixed mac/pc/*nix editor screw ups).
782 The resulting array of lines is used to feed the tokenizer, and is also
783 accessed directly by the heredoc-logic to do the line-oriented part of
786 =head2 Doing Things the Old Fashioned Way
788 Due to the complexity of perl, and after 2 previously aborted parser
789 attempts, in the end the tokenizer was fashioned around a line-buffered
790 character-by-character method.
792 That is, the Tokenizer pulls and holds a line at a time into a line buffer,
793 and then iterates a cursor along it. At each cursor position, a method is
794 called in whatever token class we are currently in, which will examine the
795 character at the current position, and handle it.
797 As the handler methods in the various token classes are called, they
798 build up a output token array for the source code.
800 Various parts of the Tokenizer use look-ahead, arbitrary-distance
801 look-behind (although currently the maximum is three significant tokens),
802 or both, and various other heuristic guesses.
804 I've been told it is officially termed a I<"backtracking parser
805 with infinite lookaheads">.
807 =head2 State Variables
809 Aside from the current line and the character cursor, the Tokenizer
810 maintains a number of different state variables.
816 The Tokenizer maintains the current token class at all times. Much of the
817 time is just going to be the "Whitespace" class, which is what the base of
818 a document is. As the tokenizer executes the various character handlers,
819 the class changes a lot as it moves a long. In fact, in some instances,
820 the character handler may not handle the character directly itself, but
821 rather change the "current class" and then hand off to the character
822 handler for the new class.
824 Because of this, and some other things I'll deal with later, the number of
825 times the character handlers are called does not in fact have a direct
826 relationship to the number of actual characters in the document.
830 Rather than create a class stack to allow for infinitely nested layers of
831 classes, the Tokenizer recognises just a single layer.
833 To put it a different way, in various parts of the file, the Tokenizer will
834 recognise different "base" or "substrate" classes. When a Token such as a
835 comment or a number is finalised by the tokenizer, it "falls back" to the
838 This allows proper tokenization of special areas such as __DATA__
839 and __END__ blocks, which also contain things like comments and POD,
840 without allowing the creation of any significant Tokens inside these areas.
842 For the main part of a document we use L<PPI::Token::Whitespace> for this,
843 with the idea being that code is "floating in a sea of whitespace".
847 The final main state variable is the "current token". This is the Token
848 that is currently being built by the Tokenizer. For certain types, it
849 can be manipulated and morphed and change class quite a bit while being
850 assembled, as the Tokenizer's understanding of the token content changes.
852 When the Tokenizer is confident that it has seen the end of the Token, it
853 will be "finalized", which adds it to the output token array and resets
854 the current class to that of the zone that we are currently in.
856 I should also note at this point that the "current token" variable is
857 optional. The Tokenizer is capable of knowing what class it is currently
858 set to, without actually having accumulated any characters in the Token.
862 =head2 Making It Faster
864 As I'm sure you can imagine, calling several different methods for each
865 character and running regexes and other complex heuristics made the first
866 fully working version of the tokenizer extremely slow.
868 During testing, I created a metric to measure parsing speed called
869 LPGC, or "lines per gigacycle" . A gigacycle is simple a billion CPU
870 cycles on a typical single-core CPU, and so a Tokenizer running at
871 "1000 lines per gigacycle" should generate around 1200 lines of tokenized
872 code when running on a 1200 MHz processor.
874 The first working version of the tokenizer ran at only 350 LPGC, so to
875 tokenize a typical large module such as L<ExtUtils::MakeMaker> took
876 10-15 seconds. This sluggishness made it unpractical for many uses.
878 So in the current parser, there are multiple layers of optimisation
879 very carefully built in to the basic. This has brought the tokenizer
880 up to a more reasonable 1000 LPGC, at the expense of making the code
881 quite a bit twistier.
883 =head2 Making It Faster - Whole Line Classification
885 The first step in the optimisation process was to add a hew handler to
886 enable several of the more basic classes (whitespace, comments) to be
887 able to be parsed a line at a time. At the start of each line, a
888 special optional handler (only supported by a few classes) is called to
889 check and see if the entire line can be parsed in one go.
891 This is used mainly to handle things like POD, comments, empty lines,
892 and a few other minor special cases.
894 =head2 Making It Faster - Inlining
896 The second stage of the optimisation involved inlining a small
897 number of critical methods that were repeated an extremely high number
898 of times. Profiling suggested that there were about 1,000,000 individual
899 method calls per gigacycle, and by cutting these by two thirds a significant
900 speed improvement was gained, in the order of about 50%.
902 You may notice that many methods in the C<PPI::Tokenizer> code look
903 very nested and long hand. This is primarily due to this inlining.
905 At around this time, some statistics code that existed in the early
906 versions of the parser was also removed, as it was determined that
907 it was consuming around 15% of the CPU for the entire parser, while
908 making the core more complicated.
910 A judgment call was made that with the difficulties likely to be
911 encountered with future planned enhancements, and given the relatively
912 high cost involved, the statistics features would be removed from the
915 =head2 Making It Faster - Quote Engine
917 Once inlining had reached diminishing returns, it became obvious from
918 the profiling results that a huge amount of time was being spent
919 stepping a char at a time though long, simple and "syntactically boring"
920 code such as comments and strings.
922 The existing regex engine was expanded to also encompass quotes and
923 other quote-like things, and a special abstract base class was added
924 that provided a number of specialised parsing methods that would "scan
925 ahead", looking out ahead to find the end of a string, and updating
926 the cursor to leave it in a valid position for the next call.
928 This is also the point at which the number of character handler calls began
929 to greatly differ from the number of characters. But it has been done
930 in a way that allows the parser to retain the power of the original
931 version at the critical points, while skipping through the "boring bits"
932 as needed for additional speed.
934 The addition of this feature allowed the tokenizer to exceed 1000 LPGC
937 =head2 Making It Faster - The "Complete" Mechanism
939 As it became evident that great speed increases were available by using
940 this "skipping ahead" mechanism, a new handler method was added that
941 explicitly handles the parsing of an entire token, where the structure
942 of the token is relatively simple. Tokens such as symbols fit this case,
943 as once we are passed the initial sigil and word char, we know that we
944 can skip ahead and "complete" the rest of the token much more easily.
946 A number of these have been added for most or possibly all of the common
947 cases, with most of these "complete" handlers implemented using regular
950 In fact, so many have been added that at this point, you could arguably
951 reclassify the tokenizer as a "hybrid regex, char-by=char heuristic
952 tokenizer". More tokens are now consumed in "complete" methods in a
953 typical program than are handled by the normal char-by-char methods.
955 Many of the these complete-handlers were implemented during the writing
956 of the Lexer, and this has allowed the full parser to maintain around
957 1000 LPGC despite the increasing weight of the Lexer.
959 =head2 Making It Faster - Porting To C (In Progress)
961 While it would be extraordinarily difficult to port all of the Tokenizer
962 to C, work has started on a L<PPI::XS> "accelerator" package which acts as
963 a separate and automatically-detected add-on to the main PPI package.
965 L<PPI::XS> implements faster versions of a variety of functions scattered
966 over the entire PPI codebase, from the Tokenizer Core, Quote Engine, and
967 various other places, and implements them identically in XS/C.
969 In particular, the skip-ahead methods from the Quote Engine would appear
970 to be extremely amenable to being done in C, and a number of other
971 functions could be cherry-picked one at a time and implemented in C.
973 Each method is heavily tested to ensure that the functionality is
974 identical, and a versioning mechanism is included to ensure that if a
975 function gets out of sync, L<PPI::XS> will degrade gracefully and just
976 not replace that single method.
980 - Add an option to reset or seek the token stream...
982 - Implement more Tokenizer functions in L<PPI::XS>
986 See the L<support section|PPI/SUPPORT> in the main module.
990 Adam Kennedy E<lt>adamk@cpan.orgE<gt>
994 Copyright 2001 - 2009 Adam Kennedy.
996 This program is free software; you can redistribute
997 it and/or modify it under the same terms as Perl itself.
999 The full text of the license can be found in the
1000 LICENSE file included with this module.