March Madness: Product Info Screen Scraping!
Since everyone else appears to be at SXSW, I suppose I’ll have to step up for today’s March Madness. I bring you: a screen scraper for retail products on ecommerce sites.
While I’m hardly the first person to write such a tool, finding useful examples or libraries among the hundreds of pages of screen scraper spam has proven difficult. I ended up writing one from scratch in PHP using the DomDocument object.
The goal of the scraper is to come up with the product title, price, and 3 most likely product photos from any given product URL. In order to make it a bit faster (it’s pretty painfully slow), I attempt to filter out images which are obviously not product photos (those which are very long/tall, those which are not displayed in the browser). Then for a bit of extra fun, it sorts the image array by it’s “likeliness” to be a product photo. Obviously it needs some refining to actually be useful.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 | loadHTMLFile($link)) { //get the title $title = $dom->getElementsByTagName('title')->item(0)->nodeValue; //get the images $images = $dom->getElementsByTagName('img'); //get the images most likely to be product photos foreach ($images as $img) { $skip = FALSE; $likely = 1; unset($img_height); unset($img_width); //get the style on the images $style = $img->getAttribute('style'); //if the style is set to not display, it's not worth checking if(preg_match('/display\:\s?none/',$style)) $skip = TRUE; //attempt to get the height and width if they are set $img_height = $img->getAttribute('height'); $img_width = $img->getAttribute('width'); if(is_numeric($img_height) && $img_height<$height) { $skip = TRUE; } else if(is_numeric($img_width) &&$img_width<$width){ $skip = TRUE; } else if(is_numeric($img_height)&&is_numeric($img_width)){ if(($img_width/$img_height)>3||($img_width/$img_height)<0.33) $skip=TRUE; } //if it's not already thrown out if($skip === FALSE){ if ( ($url = rel2abs($img->getAttribute('src'), $link)) && ($i = getimagesize($url)) && $i[0] >= ($width-10) && $i[1] >= ($height-10) ) { //if the aspect ratio is greater than 1:2, it's unlikely that it's a product image if($i[0]/$i[1]>=2||$i[0]/$i[1]<=0.5){ $likely = $likely*0.5; } $thumbs[] = array('url'=>$url,'likely'=>$likely); } //sort the array by likelyness, most likely first foreach ($thumbs as $key => $row) { $likeliness[$key] = $row['likely']; } array_multisort($likeliness,SORT_DESC,$thumbs); } } //gross hack to try to find price $xmlstring = $dom->saveHTML(); if(preg_match_all('/\$[0-9\.]+/',$xmlstring,$matches)){ $price = $matches[0][0]; } //output to browser echo " <h1>$title</h1> "; echo " <h2>$price</h2> "; foreach ($thumbs as $thumb){ $src = $thumb['url']; echo "<img src="$src" alt="" /> "; } } $time2 = microtime(true); $diff = $time2-$time1; echo "Script executed in $diff seconds"; ?> |
-
Csharpp
-
Ian

![Reblog this post [with Zemanta]](http://img.zemanta.com/reblog_e.png?x-id=5db1d73d-a755-4d8d-9886-8ac7c1520ec1)