Mar 132010

Since everyone else appears to be at SXSW, I suppose I’ll have to step up for today’s March Madness. I bring you: a screen scraper for retail products on ecommerce sites.

While I’m hardly the first person to write such a tool, finding useful examples or libraries among the hundreds of pages of screen scraper spam has proven difficult. I ended up writing one from scratch in PHP using the DomDocument object.

The goal of the scraper is to come up with the product title, price, and 3 most likely product photos from any given product URL. In order to make it a bit faster (it’s pretty painfully slow), I attempt to filter out images which are obviously not product photos (those which are very long/tall, those which are not displayed in the browser). Then for a bit of extra fun, it sorts the image array by it’s “likeliness” to be a product photo. Obviously it needs some refining to actually be useful.

loadHTMLFile($link)) {
	//get the title
	$title = $dom->getElementsByTagName('title')->item(0)->nodeValue;

	//get the images
	$images = $dom->getElementsByTagName('img');

	//get the images most likely to be product photos
    foreach ($images as $img) {
    	$skip = FALSE;
    	$likely = 1;
    	unset($img_height); unset($img_width);

		//get the style on the images
		$style = $img->getAttribute('style');

		//if the style is set to not display, it's not worth checking
		if(preg_match('/display\:\s?none/',$style)) $skip = TRUE;

		//attempt to get the height and width if they are set
		$img_height = $img->getAttribute('height');
		$img_width = $img->getAttribute('width');
		if(is_numeric($img_height) && $img_height<$height) {
			$skip = TRUE;
		} else if(is_numeric($img_width) &&$img_width<$width){ 			$skip = TRUE; 		} else if(is_numeric($img_height)&&is_numeric($img_width)){ 			if(($img_width/$img_height)>3||($img_width/$img_height)<0.33) $skip=TRUE; 		} 		 		//if it's not already thrown out 		if($skip === FALSE){ 		   if (                 ($url = rel2abs($img->getAttribute('src'), $link)) &&
                ($i = getimagesize($url)) &&
                $i[0] >= ($width-10) &&
                $i[1] >= ($height-10)
            ) {
           		//if the aspect ratio is greater than 1:2, it's unlikely that it's a product image
				if($i[0]/$i[1]>=2||$i[0]/$i[1]<=0.5){ 					$likely = $likely*0.5; 				}             	             	                 $thumbs[] = array('url'=>$url,'likely'=>$likely);

            //sort the array by likelyness, most likely first
            foreach ($thumbs as $key => $row) {
    			$likeliness[$key] = $row['likely'];


    //gross hack to try to find price
    $xmlstring = $dom->saveHTML();
    	$price = $matches[0][0];

    //output to browser
    echo "


"; echo "


"; foreach ($thumbs as $thumb){ $src = $thumb['url']; echo " "; } } $time2 = microtime(true); $diff = $time2-$time1; echo "Script executed in $diff seconds"; ?>
Reblog this post [with Zemanta]

  2 Responses to “March Madness: Product Info Screen Scraping!”

Comments (2)
  1. You should try the Beta2 version of ScrapePro Web Scraper Designer application for free:

  2. Did you have any thoughts about a nicer way to get price information?

 Leave a Reply

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>