Goutte is based on the Symfony framework. Symfony is a set of PHP components: a Philosophy, a Web application framework, and a community – all working together in harmony. It is a PHP framework and a set of reusable components/libraries. Symfony was created by Sensio labs and was published as free software in 2005 and was released under MIT. Google scraper A simple PHP scraper that makes use of the Symfony web framework. This application supports extracting data from the Google search results page.
We can use Spiral Queue and RoadRunner server to implement applications different from classic web setup. In thistutorial, we will try to implement a simple web-scraper application for CLI usage.
The scraped data will be stored in a runtime
folder.
The produced code only demonstrates the capabilities and can be improved a lot.
Installing Dependencies
We will base our application on spiral/app-cli - the minimalistic spiral build without ORM, HTTP, and other extensions.
To implement all needed features we will need a set of extensions:
Extension | Comment |
---|---|
spiral/jobs | Queue support |
spiral/scaffolder | Faster scaffolding (dev only) |
spiral/prototype | Faster prototyping (dev only) |
paquettg/php-html-parser | Parsing HTML |
To install all needed packages and download app server:
Activate the installed extensions in your AppApp
:
Make sure to run php app.php configure
to ensure proper installation.
Configure App Server
Let's configure the application server with one default queue in memory. Create .rr.yaml
file in the root of the project:
Symfony Web Scraper Login
Create Job Handler
Now, let's write a simple job handler which will scan the website, get the HTML content, and jump by links util the specificdepth reached. All the content will be stored in runtime
directory.
Create JobHandler via php app.php create:job scrape
. We are not going to use CURL for simplicity.
Create command
Create a Command to start scraping php app.php create:command scrape
:
Test it
Launch application server first:
Scape any URL via console command (keep the server running):
To observe how many pages scraped via interactive console:
Symfony Web Scraper Download
The demo solution will scan some pages multiple times, use a proper database or lock mechanism to avoid that.