Symfony Web Scraper



Goutte is based on the Symfony framework. Symfony is a set of PHP components: a Philosophy, a Web application framework, and a community – all working together in harmony. It is a PHP framework and a set of reusable components/libraries. Symfony was created by Sensio labs and was published as free software in 2005 and was released under MIT. Google scraper A simple PHP scraper that makes use of the Symfony web framework. This application supports extracting data from the Google search results page.

We can use Spiral Queue and RoadRunner server to implement applications different from classic web setup. In thistutorial, we will try to implement a simple web-scraper application for CLI usage.

The scraped data will be stored in a runtime folder.

The produced code only demonstrates the capabilities and can be improved a lot.

Installing Dependencies

Symfony Web Scraper

We will base our application on spiral/app-cli - the minimalistic spiral build without ORM, HTTP, and other extensions.

Symfony Web Scraper

To implement all needed features we will need a set of extensions:

ExtensionComment
spiral/jobsQueue support
spiral/scaffolderFaster scaffolding (dev only)
spiral/prototypeFaster prototyping (dev only)
paquettg/php-html-parserParsing HTML

To install all needed packages and download app server:

Activate the installed extensions in your AppApp:

Make sure to run php app.php configure to ensure proper installation.

Configure App Server

Let's configure the application server with one default queue in memory. Create .rr.yaml file in the root of the project:

Symfony Web Scraper Login

Create Job Handler

Now, let's write a simple job handler which will scan the website, get the HTML content, and jump by links util the specificdepth reached. All the content will be stored in runtime directory.

Scraper

Create JobHandler via php app.php create:job scrape. We are not going to use CURL for simplicity.

Create command

Create a Command to start scraping php app.php create:command scrape:

Test it

Launch application server first:

Scape any URL via console command (keep the server running):

Scraper

To observe how many pages scraped via interactive console:

Symfony Web Scraper Download

The demo solution will scan some pages multiple times, use a proper database or lock mechanism to avoid that.