About jkphl/rdfa-lite-microdata

Build Status Coverage Status Scrutinizer Code Quality Code climate Documentation Status Clear architecture

RDFa Lite 1.1 and HTML Microdata parser for web documents (HTML, SVG, XML)

rdfa-lite-microdata is used for extracting RDFa Lite 1.1 and HTML Microdata information out of web documents (HTML / SVG / XML). The embedded structures may use arbitrary vocabularies (e.g. schema.org) and are returned as a Plain Old PHP Object (POPO) which is compliant with the JSON serialization described for HTML Microdata.

RDFa Lite 1.1

To extract RDFa Lite 1.1 data out of a web document, instantiate an RdfaLite parser and call the appropriate parse method:

$rdfaParser = new \Jkphl\RdfaLiteMicrodata\Ports\Parser\RdfaLite();

// Parse an HTML file
$rdfaItems = $rdfaParser->parseHtmlFile('/path/to/file.html');

// Parse an HTML string
$rdfaItems = $rdfaParser->parseHtml('<html><head>...</head><body vocab="http://schema.org/">...</body>');

// Parse a DOM document (here: created from an HTML string)
$rdfaDom = new \DOMDocument();
$rdfaDom->loadHTML('<html><head>...</head><body vocab="http://schema.org/">...</body>');
$rdfaItems = $rdfaParser->parseDom($rdfaDom);

// Parse an XML file (e.g. SVG)
$rdfaItems = $rdfaParser->parseXmlFile('/path/to/file.svg');

// Parse an XML string (e.g. SVG)
$rdfaItems = $rdfaParser->parseXml('<svg viewBox="0 0 100 100" vocab="http://schema.org/">...</svg>');

echo json_encode($rdfaItems, JSON_PRETTY_PRINT);

The resulting JSON serialization will look something like this (JSON serialization):

{
    "items": [
        {
            "type": [
                "http://schema.org/Movie"
            ],
            "id": "http://www.imdb.com/title/tt0499549/",
            "properties": {
                "http://schema.org/name": [
                    "Avatar"
                ],
                "http://schema.org/director": [
                    {
                        "type": [
                            "http://schema.org/Person"
                        ],
                        "id": null,
                        "properties": {
                            "http://schema.org/name": [
                                "James Cameron"
                            ],
                            "http://schema.org/birthDate": [
                                "August 16, 1954"
                            ]
                        }
                    }
                ],
                "http://schema.org/genre": [
                    "Science fiction"
                ],
                "http://schema.org/trailer": [
                    "../movies/avatar-theatrical-trailer.html"
                ]
            }
        }
    ]
}

Item types and property names can be treated as references consisting of a profile IRI and a separate name. To enable IRI mode, instantiate the parser with true as argument:

$rdfaParser = new \Jkphl\RdfaLiteMicrodata\Ports\Parser\RdfaLite(true);
$rdfaItems = $rdfaParser->parseHtmlFile('/path/to/file.html');

With IRI mode enabled, the result will look like more verbose (JSON serialization):

{
    "items": [
        {
            "type": [
                {
                    "profile": "http://schema.org/",
                    "name": "Movie"
                }
            ],
            "id": "http://www.imdb.com/title/tt0499549/",
            "properties": {
                "http://schema.org/name": {
                    "profile": "http://schema.org/",
                    "name": "name",
                    "values": [
                        "Avatar"
                    ]
                },
                "http://schema.org/director": {
                    "profile": "http://schema.org/",
                    "name": "director",
                    "values": [
                        {
                            "type": [
                                {
                                    "profile": "http://schema.org/",
                                    "name": "Person"
                                }
                            ],
                            "id": null,
                            "properties": {
                                "http://schema.org/name": {
                                    "profile": "http://schema.org/",
                                    "name": "name",
                                    "values": [
                                        "James Cameron"
                                    ]
                                },
                                "http://schema.org/birthDate": {
                                    "profile": "http://schema.org/",
                                    "name": "birthDate",
                                    "values": [
                                        "August 16, 1954"
                                    ]
                                }
                            }
                        }
                    ]
                },
                "http://schema.org/genre": {
                    "profile": "http://schema.org/",
                    "name": "genre",
                    "values": [
                        "Science fiction"
                    ]
                },
                "http://schema.org/trailer": {
                    "profile": "http://schema.org/",
                    "name": "trailer",
                    "values": [
                        "../movies/avatar-theatrical-trailer.html"
                    ]
                }
            }
        }
    ]
}

HTML Microdata

The Microdata format isn't specified for non-HTML host formats, so the Microdata parser only supports HTML processing:

$microdataParser = new \Jkphl\RdfaLiteMicrodata\Ports\Parser\Microdata();

// Parse an HTML file
$microdataItems = $microdataParser->parseHtmlFile('/path/to/file.html');

// Parse an HTML string
$microdataItems = $microdataParser->parseHtml('<html><head>...</head><body itemscope itemtype="http://schema.org/Movie">...</body>');

// Parse a DOM document created from an HTML string
$microdataDom = new \DOMDocument();
$microdataDom->loadHTML('<html><head>...</head><body itemscope itemtype="http://schema.org/Movie">...</body>');
$microdataItems = $microdataParser->parseDom($microdataDom);

// Parse an HTML string with types / property names treated as IRIs
$microdataParserIri = new \Jkphl\RdfaLiteMicrodata\Ports\Parser\Microdata(true);
$microdataItems = $microdataParser->parseHtmlFile('/path/to/file.html');

Installation

This library requires PHP >=5.5 or later. I recommend using the latest available version of PHP as a matter of principle. It has no userland dependencies. It's installable and autoloadable via Composer as jkphl/rdfa-lite-microdata.

composer require jkphl/rdfa-lite-microdata

Alternatively, download a release or clone the repository, then require or include its autoload.php file.

Dependencies

Composer dependency graph

License

Copyright © 2017 Joschi Kuphal / joschi@tollwerk.de. Licensed under the terms of the MIT license.