Html Tokenizer

This package will tokenize HTML input.

Some uses of HTML tokens:

Tidy/Minify HTML output
Preprocess HTML
Filter HTML
Sanitize HTML

Install

Via Composer

$ composer require kevintweber/html-tokenizer

Usage

<?php

namespace Kevintweber\HtmlTokenizer;

$htmlDocument = file_get_contents("path/to/html/document.html");

$htmlTokenizer = new HtmlTokenizer();
$tokens = $htmlTokenizer->parse($htmlDocument);  // That was easy ...

// Once you have tokens, you can manipulate them.
foreach ($tokens as $token) {
    if ($token->isElement()) {
        echo $token->getName() . "\n";
    }
}

// Or just output them to an array.
$tokenArray = $tokens->toArray();

The following simple HTML:

<!DOCTYPE html>
<html>
    <head>
        <title>Test</title>
    </head>
    <body>
        <!-- Start of content. -->
        <h1 id="big_title">Whoa!</h1>
        <div class="centered">It <em>parses</em>!</div>
    </body>
</html>

will produce the following array:

array(
    array(
        'type' => 'doctype',
        'value' => 'html',
        'line' => 0,
        'position' => 0
    ),
    array(
        'type' => 'element',
        'name' => 'html',
        'line' => 1,
        'position' => 0,
        'children' => array(
            array(
                'type' => 'element',
                'name' => 'head',
                'line' => 2,
                'position' => 4,
                'children' => array(
                    array(
                        'type' => 'element',
                        'name' => 'title',
                        'line' => 3,
                        'position' => 8,
                        'children' => array(
                            array(
                                'type' => 'text',
                                'value' => 'Test',
                                'line' => 3,
                                'position' => 15
                            )
                        )
                    )
                )
            ),
            array(
                'type' => 'element',
                'name' => 'body',
                'line' => 5,
                'position' => 4,
                'children' => array(
                    array(
                        'type' => 'comment',
                        'value' => 'Start of content.',
                        'line' => 6,
                        'position' => 8
                    ),
                    array(
                        'type' => 'element',
                        'name' => 'h1',
                        'line' => 7,
                        'position' => 8,
                        'attributes' => array(
                            'id' => 'big_title'
                        ),
                        'children' => array(
                            array(
                                'type' => 'text',
                                'value' => 'Whoa!',
                                'line' => 7,
                                'position' => 27
                            )
                        )
                    ),
                    array(
                        'type' => 'element',
                        'name' => 'div',
                        'line' => 8,
                        'position' => 8,
                        'attributes' => array(
                            'class' => 'centered'
                        ),
                        'children' => array(
                            array(
                                'type' => 'text',
                                'value' => 'It ',
                                'line' => 8,
                                'position' => 30
                            ),
                            array(
                                'type' => 'element',
                                'name' => 'em',
                                'line' => 8,
                                'position' => 33,
                                'children' => array(
                                    array(
                                        'type' => 'text',
                                        'value' => 'parses',
                                        'line' => 8,
                                        'position' => 37
                                    )
                                )
                            ),
                            array(
                                'type' => 'text',
                                'value' => '!',
                                'line' => 8,
                                'position' => 48
                            )
                        )
                    )
                )
            )
        )
    )
)

Tokens

The tokens are of the following types:

Name	Example
`cdata`	<![CDATA[ Character data goes in here. ]]>
`comment`	<!-- Comments go in here. -->
`doctype`	<!DOCTYPE html>
`element`	<img alt="Most of your markup will be elements."/>
`php`	<?php echo "PHP code goes in here."; ?>
`text`	Most of your content will be text.

Special parsing situations

Contents of an "iframe" element are not parsed.
Contents of a "script" element are considered TEXT.
Contents of a "style" element are considered TEXT.

Limitations

Currently, this package will tokenize HTML5 and XHTML.

It tries to handle errors according to the standard. The tokenizer can handle some (but not all) malformed HTML. You can set the tokenizer to fail silently or throw an exception when it encounters an error. (The default setting is to throw an exception.)

If you come across valid HTML this package cannot parse, please submit an issue.

Change log

Please see CHANGELOG for more information what has changed recently.

Testing

$ phpunit

Contributing

Please see CONTRIBUTING for details.

Security

If you discover any security related issues, please email [email protected] instead of using the issue tracker.

Credits

License

The MIT License (MIT). Please see License File for more information.

Name		Name	Last commit message	Last commit date
Latest commit History 53 Commits
src		src
tests		tests
.editorconfig		.editorconfig
.gitattributes		.gitattributes
.gitignore		.gitignore
.scrutinizer.yml		.scrutinizer.yml
.styleci.yml		.styleci.yml
.travis.yml		.travis.yml
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE.md		LICENSE.md
README.md		README.md
composer.json		composer.json
phpunit.xml.dist		phpunit.xml.dist

License

kevintweber/HtmlTokenizer

Folders and files

Latest commit

History

Repository files navigation

Html Tokenizer

Install

Usage

Tokens

Special parsing situations

Limitations

Change log

Testing

Contributing

Security

Credits

License

About

Topics

Resources

License

Stars

Watchers

Forks

Languages