1.

Solve : Farming Webpages .... any suggestions for data acquisition?

Answer»

I have been farming web sites for information through automated macros which dump the copy/pasted data into a database.

I am checking to see if anyone knows of an easier way to farm web sites for information than having to create a mouse/keyboard macro for each web site to farm information from for analysis?

MAYBE a piece of software that interfaces with the HTML source of the web page itself and copies to clipboard information after a specified flag such as a web site stating that it is 83 degrees out if you look at its source will show a field that shows 83 degrees and to copy it directly from HTML source.

Just want to also mention that... I have NO NEED to capture keystrokes or screenshots which could be used maliciously. I simply want an easier way to interface with static information provided openly to all public at these sites and have no need for dynamic data such as data entered by a user which could be used for the wrong means.

Any suggestions greatly appreciated.Hi Dave. I do this kind of stuff from time to time. I TEND to use PHP, partly because I'm very FAMILIAR with at and partly because it's very good at processing text. Here's the general process:

  • if the web site requires authentication, programmatically send credentials by whatever means it normally requires (HTTP authentication, POST variables, etc) and allow cookies to be set
  • load page into variable
  • parse page using REGULAR expression PATTERN matching
  • dump extracted data into database, text file, whatever

And here's an example:

Code: [Select]<?php
/****************************************
*BEGIN:ConfigurecURL*
****************************************/
$ch=curl_init();
curl_setopt($ch,CURLOPT_POST,1);
curl_setopt($ch,CURLOPT_FOLLOWLOCATION,false);
curl_setopt($ch,CURLOPT_COOKIEJAR,dirname(__FILE__).'/cookie.txt');
curl_setopt($ch,CURLOPT_HEADER,1);
curl_setopt($ch,CURLOPT_RETURNTRANSFER,1);
/****************************************
*END:ConfigurecURL*
****************************************/

$post='username=username&password=password';
curl_setopt($ch,CURLOPT_URL,'http://www.example.com/login');
curl_setopt($ch,CURLOPT_POSTFIELDS,$post);

if($page=curl_exec($ch))
{
//Thatwasthelogin;nowtoretrievethepages:
$regexp='|insertregexpherewith(bracketsaroundtextwewanttosave)|isU';

//Pagetoparse
$url="http://www.example.com/start";

//Loadthepage
curl_setopt($ch,CURLOPT_URL,$url);
$page=curl_exec($ch);

//Findthedesiredtext
if(preg_match_all($regexp,$page,$result))
{
//dosomethingwiththematches
}

//Savethepage
if(file_put_contents("/somewhere/file.txt",$page))
{
echo"succeeded<br/>";
}else{
echo"FAILED<br/>";
}

}else{
echo"Loginfailed";
}
curl_close($ch);
?>Thanks Rob!


I am going to try this outCool. The hardest part is getting the regular expression right. Let me know if you need any more help.

Out of interest, this technique is usually called "web scraping" or sometimes (inaccurately) "screen scraping".


Discussion

No Comment Found