Answer» Hey guys,
I need a little help scraping a no-frills website. The main problem I have is sending headers or cookies to set a store. If you've never been to the website, the first time you visit it asks you to select Province, City, and the Store. Then I have access to viewing ITEMS and prices of that store. I've tried using various methods using cURL but I get "Received HTTP code 403 from proxy after CONNECT" error.
Here is the LINK: http://www.nofrills.ca/LCLOnline/flyers_landing_page.jsp - you can select any province, city and store for testing.
Please help me. Thank you in advance,
- ultimatumScreen-scraping is often a breach of the terms of use of a particular website. This makes it an activity that I'm cautious about supporting in detail. If you have genuine reasons for doing this, you'll find Fiddler absolutely invaluable in analysing the "conversation" between your browser and the web server. Having examined this, it is easier to replicate the web transaction through cURL.
Cookie management is documented in the PHP manual. Have a LOOK in particular at the cURL options array. Here's an example of a cURL constructor I use in one of my projects:
Code: [Select] /** * Default Constructor * Set up cURL session with cookies * */ function __construct($cookie_serial=''){ $this->_ch = curl_init(); curl_setopt($this->_ch, CURLOPT_POST, 1); curl_setopt($this->_ch, CURLOPT_FOLLOWLOCATION, false); curl_setopt($this->_ch, CURLOPT_COOKIEJAR, '/tmp/cookies/cookie_'.$cookie_serial.'.txt'); curl_setopt($this->_ch, CURLOPT_COOKIEFILE, '/tmp/cookies/cookie_'.$cookie_serial.'.txt'); curl_setopt($this->_ch, CURLOPT_HEADER , 1); curl_setopt($this->_ch, CURLOPT_RETURNTRANSFER, 1); curl_setopt($this->_ch, CURLOPT_TIMEOUT, 60); curl_setopt($this->_ch, CURLOPT_CONNECTTIMEOUT, 60); $this->_html = ''; }
Note the use of a parameter for the cookie jar, to avoid collision between multiple cURL instances.
Quote from: ROB Pomeroy on March 26, 2012, 06:39:57 AM you'll find Fiddler absolutely invaluable in analysing the "conversation" between your browser and the web server
Thanks for that! See the problem is that you have to manually make a selection before post variables are sent and these variables are processed through javascript. I understand and have used cURL + cookies for login and such but I here the script needs to save cookies before it will send them back on another page.
Thank you for the tips and quick response (for sure) but I found a SOLUTION to this problem.
- ultimatum
Quote from: ultimatum on March 26, 2012, 08:03:49 PMSee the problem is that you have to manually make a selection
No - that's what the "scraping" bit is about. Pull in the first web page and use a regular expression to extract the variables from the SELECT/OPTION form controls. Then you do your post.
|