Instagram

Wiki It

Search results

Tuesday 26 February 2013

PHP Web Page Scraping in Codeigniter


Hey there guys and welcome to my another interesting blog! Interesting because today I am going to show you how to scrape contents from a web page. 

First of all we need to understand what is Web page scraping. 

Web Scraping is a technique of extracting information from the websites using specially coded programs!

There are 3 ways to access a website data. One is through web browser, other is using an API(if the site provides one) and the last one is known as Web Scraping, which is what I am going to show you today!

Before starting with scraping we need to download one cool library which is available for free from SourceForge

This writing shows a simple scraper using the simplehtmldom library. But before we continue we need to be careful that if a website does not provide any RSS feed or an API to extract information from their web pages then their website is probably copyright protected and grabbing information from a site and use it somewhere else may well be a violation of someone's rights and eventually may land you in trouble! So keep this in mind.

Lets get started,

INSTALLING simplehtmldom IN CODEIGNITER:

After downloading your library from SourceForge unzip the library and copy simple_html_dom.php file into your library folder present inside your application folder.

And as you may know this you need to open your autoload.php file from config folder and add the following line

$autoload['libraries'] = array('simple_html_dom');

Save it and close. That's it you have successfully loaded your library to your application! Now lets create a controller file in the next step.

CREATING A SCRAPING CONTROLLER:

Copy paste this code into your php class file and save it.

class Scraper extends CI_Controller
{
    public function __construct()
    {
        parent::__construct();
    }
   
    public function get_html_dom()
    {
        $url = "http://play.google.com/";
        $html = file_get_html($url);
        $data =  $html->find('.top-list-container');
       
        if(isset($data[0]))
        {
            echo $data[0]->children(1)->children(1);
        }
    }
}


Lets go through the method now.

If you take a look then you will probably come to know that I am scraping content from Google Play Store!! :P

The id and class selectors may change in the future when Google will change their page templates or make any changes to their selectors so you need to make changes in your program too!

The function  file_get_html( )  is a function of the simple_html_dom.php class to which I am passing in the play store url. 

Now we need to traverse the DOM using find( ) function as shown above. I am finding 
class = top-list-container from the html page and then I am echoing out the child tag inside the container! Refresh your browser you will see how the library fetches the contents from the web page! 

This is a simple example you can even use foreach loop to loop through more than one conatiners inside the html page and so on! 

Hope you people like this blog! Thank you see you next time.


4 comments:

  1. Nice blog,thanks for sharing the nice information and i read the full blog and i have to sure bookmark this blog for the future use.

    website scraping

    ReplyDelete
  2. how to scrape data which have post method? using simplehtmldom and CI framework

    ReplyDelete
  3. Hello sir when i go to sourceforge the file is not available my email id is rajatvarshney2009@gmail.com can you please send me the library file thanks

    ReplyDelete