Web Scraping/Grabbing Data

Blake · January 2, 2013

I need to write a script to grab a bunch of data off of Walmart. I need to look at a bunch of different products, and get infomration about them, and save them to a text file. The information I want is stored in <meta> tags. This is how I plan to do it...

Start with a URL like http://www.walmart.com/ip/10311410

Save the HTML source for <meta

Search the HTML source for <meta, save the values I want

Increment the number in the url above

In the end, I'd like a file with values like this -

Cheerios | 3.00 | Kellogs | other data

I'm not really sure where to start on this. I'd like to do this in .NET, C#, in Visual studio 2012. Any tips on where to start or what to do?

RSBDavid · January 2, 2013

I would look into things such as "C# or VB.NET Web Crawler" and "Regular expressions".

The crawler will show you how to actually visit and download webpages and the regular expressions would help you find data between tags.

Blake · January 2, 2013

Thanks for the tips :) So far, I wrote this quickly as a web page, but when it runs, it returns nothing. Just trying to get the stream (HTML of a website) to show up on a webpage, so that I know it fetches it correctly.

string urlfinal = "http://www.google.com";

WebRequest request = WebRequest.Create(urlfinal);

StreamReader stream = new StreamReader(request.GetResponse().GetResponseStream());

StringBuilder builder = new StringBuilder();

stream.Close();

string answer = builder.ToString();

Label ans = new Label();

ans.Text = answer;

addhere.Controls.Add(ans);

After I get the stream to work, writing a loop to change the value of the number at the end of the url, parsing the stream to get the values I want, and saving them to a file shouldn't be that hard (I hope). I'd like to have this done by tomorrow night if I can!

obfuscator · January 2, 2013

If you're comfortable with Linq I'd recommend the html agility pack: http://htmlagilitypack.codeplex.com/

Whatever you do, don't use Regex.

RSBDavid · January 2, 2013

Whatever you do, don't use Regex.

While regex can have issues with advanced HTML tags, for what blake is doing, it will be fine and doesn't require any external libraries to do so. If he was writing a generic crawler for any type of page, regex would be bad, but since he is only scraping Walmart pages and they all have the same simple format, he will be fine.

obfuscator · January 2, 2013

If they "all have the same format and everything will be the same" then he might as well use a simple xdocument. Regex is almost never an optimal solution for parsing html.

Blake · January 3, 2013

This is what I have to get the html - it works now, I tested by displaying it in a textbox.

string urlfinal = "http://www.google.com";
 WebClient client = new WebClient();
 string answer = client.DownloadString(urlfinal);

This works fine with google.com, but when I try the page I want - http://www.walmart.com/ip/10311410 - it gives this error... any idea why/how to fix it?

[hide][/hide]

I've never done regexes, so I guess it's not a problem to not use them. I've downloaded the HTML agility pack, but I don't really know how to install/use it, and then the code to actually make it work. After I can get the parsing done, I think it will be simple to set up a couple loops and save it to a file... any tips?

Blake · January 3, 2013

Got the agility pack installed and working correctly (I think)

Posting a couple thoughts/notes before I leave for an hour or two (haven't tested them yet) -

When using HTML agility pack, do I have to have a .htm file, or can I search through a string with it? If not, will I have to get the html, save it as a .htm file, then load the file and search it with the agility pack?

Will this work alright? - http://stackoverflow.com/questions/3231841/getting-meta-tag-attribute-with-html-agility-pack-using-xpath

Blake · January 3, 2013

Making progress, still two major roadblocks though. Also, triple post :P First, i'll post my code...

string urlfinal = "http://www.google.com";
    WebClient client = new WebClient();
    string answer = client.DownloadString(urlfinal);
    TextWriter tw = new StreamWriter(@"C:\Documents and Settings\thic009\Desktop\htm.htm");
    tw.WriteLine(answer);
    tw.Close();
    var doc = new HTMLDocument();
    HtmlDocument htmldoc = new HtmlDocument();
    htmldoc = loadhtmfile();
    List<string> metaTags = new List<string>();
    metaTags = extractmetatags(htmldoc);
    //metas is a gridview
    metas.DataSource = metaTags;
    metas.DataBind();
    TextBox ans = new TextBox();
    ans.TextMode = TextBoxMode.MultiLine;
    ans.Columns = 140;
    ans.Rows = 80;
    ans.Text = answer + "done";
    addhere.Controls.Add(ans);

   }
    private List<String> extractmetatags(HtmlDocument doc)
    {
	    List<string> metatags = new List<string>();
	    foreach (HtmlNode meta in doc.DocumentNode.SelectNodes("//a[@href]"))
	    {
		    HtmlAttribute att = meta.Attributes["href"];
		    metatags.Add(att.Value);
	    }
	    return metatags;
    }
    private HtmlDocument loadhtmfile()
    {
	    TextReader reader = File.OpenText(@"C:\Documents and Settings\thic009\Desktop\htm.htm");
	    HtmlDocument doc = new HtmlDocument();
	    doc.Load(reader);
	    reader.Close();
	    return doc;
    }

First major probelm - the google.com website works fine with this, although the one I want, the walmart one, doesn't. This is still the same problem as the post before (I updated the picture though). Is Walmart somehow stopping me from saving the html? That's the line I get an error at.

Second - In the code above, it searches through the HTML, and gets all of the values of the href tags. At first, I tried it with meta tags, but it gave an error (Object not set to an instance of an object, I think). With href, it worked fine. I displayed the answers in a gridview, this is what the output looks like.

[hide][/hide]

Any ideas on how to fix those? Also, am I heading in the right direction with this problem? I don't know a lot about different libraries or different ways of doing things, so I want to make sure i'm on the right track.

Hedgehog · January 3, 2013

If you wanted to make this really easy, you could use this php script:

<?php
$meta = get_meta_tags("http://www.walmart.com/ip/" . $_GET['id']);
$meta = json_encode($meta);
echo($meta);
?>

{"title":"Cheerios Toasted Whole Grain Oat Cereal, 18 oz","description":"Cheerios Toasted Whole Grain Oat Cereal - Walmart.com","keywords":"Shop for the Cheerios Toasted Whole Grain Oat Cereal that is available at an every day low price from Walmart.com.","verify-v1":"4kvsU2tt\/Ic9Z39fTPz2rJhJCUvOnYMvHF1z8ypaSvo="}

Then download that and parse it using one of the many json parsers.

Obviously it'd be slower, but it would solve a lot of your problems.

obfuscator · January 3, 2013

I don't know enough about the html agility pack to answer you questions specifically. I can't speak to why that walmart page is rejecting your request...it may expect certain request headers that aren't being sent. Will it work with php curl?

Blake · January 3, 2013

That's awesome! Didn't know what a json parser was, but I googled it and found an online one. I've never used php before, and am new to making applications and different IDE's... so I'm not sure how/where to use the php code, how to put it in a loop, and how to run it. If I could get those, and save a file full of what's in your last line in your code block, that's exactly what I need.

EDIT: in the url part of your code, how does $_get['id'] lead you to the cheerios page, or return 10311410 (the end of the url for the cheerios?)

EDIT2: also, in the 5.2k lines of code for the walmart page, there's three blocks of meta tags. I need the information in the second one, ie the

<meta itemprop="image" content="http://i.walmartimages.com/i/p/00/01/60/00/27/0001600027528_300X300.jpg"/>
<meta itemprop="brand" itemtype="http://schema.org/Organization" content="Cheerios"/>
<meta itemprop="manufacturer" itemtype="http://schema.org/Organization" content="General Mills"/>
<meta itemprop="model" content="27528"/>
<meta itemprop="productID" content="016000275287"/>
<div itemprop="offers" itemscope itemtype="http://schema.org/Offer">
<meta itemprop="availability" itemtype="http://schema.org/ItemAvailability" content="http://schema.org/InStock"/>
<meta itemprop="price" content="3.68"/>
<meta itemprop="priceCurrency" content="USD"/>
<meta itemprop="seller" itemtype="http://schema.org/Organization" content="Walmart.com"/>

It looks like yours just got the ones that start with <meta name="

Hedgehog · January 4, 2013

http://www.000webhost.com/

Decent enough host. It goes down more often than I would like, but it's the best free host that allows php that I have found. Create a domain. Open the file manager, go to your public_html folder, create default.php. Paste

<?php
$meta = get_meta_tags("http://www.walmart.com/ip/" . $_GET['id']);
$meta = json_encode($meta);
echo($meta);
?>

into it, and save.

Use your program to download yoursite.com/?id=[id of the item]

Then use a json parser and ta-da.

Edit: Saw your edit. I'll see if I can figure out a solution.

Hedgehog · January 4, 2013

<?php

$dom = new DOMDocument();
$content = file_get_contents('http://www.walmart.com/ip/' . $_GET['id']);
@$dom->loadHTML($content);
$a = $dom->getElementsByTagName('meta');

for ($i=0; $i < $a->length; $i++) {
   if ($a->item($i)->getAttribute('itemprop')){
       $b = $a->item($i)->getAttribute('content');
       $c = $a->item($i)->getAttribute('itemprop');
       $json[$c] = $b;
   }
}

echo str_replace('\\/', '/', json_encode($json));

?>

yoursite.com/?id=10311410 would echo

{"image":"http://i.walmartimages.com/i/p/00/01/60/00/27/0001600027528_300X300.jpg","brand":"Cheerios","manufacturer":"General Mills","model":"27528","productID":"016000275287","availability":"http://schema.org/InStock","price":"3.68","priceCurrency":"USD","seller":"Walmart.com","name":"Cheerios Toasted Whole Grain Oat Cereal, 18 oz","dateModified":"2013-01-02","itemReviewed":"Cheerios Toasted Whole Grain Oat Cereal, 18 oz","datePublished":"2011-05-29"}

Blake · January 4, 2013

That's perfect, thank you all for your help. I'm going to work on putting it in a loop to do it for the next 1000 product id's, and see how it goes.

Blake · January 4, 2013

Well that was easy enough :) Like I said earlier, I haven't ever used PHP, but this was easy enough to understand.

http://blake.net63.net/?id=10311410

<?php

//echo $_GET['id'];
$id = $_GET['id'];
for ($id; $id < ($id + 1000) ; $id++){
$dom = new DOMDocument();
$content = file_get_contents('http://www.walmart.com/ip/' . $id);
@$dom->loadHTML($content);
$a = $dom->getElementsByTagName('meta');
for ($i=0; $i < $a->length; $i++) {
       if ($a->item($i)->getAttribute('itemprop')){
               $b = $a->item($i)->getAttribute('content');
               $c = $a->item($i)->getAttribute('itemprop');
               $json[$c] = $b;
       }
}
echo str_replace('\\/', '/', json_encode($json));
}
?>

The way it returns, like if you click on the link, is multiple JSON objects. Pasting that into the parser doesn't exactly work, as it requires only one object. If I edit it and change the end } and the beginning { of each object (except the first and last) to a "," it will work, showing it as one big object. Might be a little confusing, so example -

Code provides this

 {obj1} {obj2} {obj3} ... {obj 1000}

The Json reader requires one object, so if I change it to...

 {obj1 , obj2 , obj3 ... obj 1000}

it works fine, just displaying them as one object.

The PHP just creates them as json objects, so I don't know if there would be an easy way to do the above change, but I don't think I need to worry about that right now. Thanks again for the help!

EDIT: The hosting service only provides a 10 second calculation limit, so it will only get a few items. Is there a way I can write this locally, and run it on a local server, for example in Visual Studio 2012, so that it will get all of the objects, or however many I like?

Hedgehog · January 4, 2013

Why do you need to handle so many items at once?

Blake · January 4, 2013

The idea started by wanted to compare prices in weekly ads of the company I work at, and other ones. I was seeing the feasability of getting the price of every product that's on a website. It's more of a proof of concept really. Thinking about it, you could use a similar script to get values of anything like currency exchange rates, weather information etc.

obfuscator · January 4, 2013

If you want to run php locally you can use xampp: http://www.apachefriends.org/en/xampp.html It basically installs a development web server locally on your machine.

Hedgehog · January 4, 2013

The idea started by wanted to compare prices in weekly ads of the company I work at, and other ones. I was seeing the feasability of getting the price of every product that's on a website. It's more of a proof of concept really. Thinking about it, you could use a similar script to get values of anything like currency exchange rates, weather information etc.

Then you only need to handle a certain amount of items at a time. It doesn't make sense to download everything at once.

Blake · January 4, 2013

The idea started by wanted to compare prices in weekly ads of the company I work at, and other ones. I was seeing the feasability of getting the price of every product that's on a website. It's more of a proof of concept really. Thinking about it, you could use a similar script to get values of anything like currency exchange rates, weather information etc.
Then you only need to handle a certain amount of items at a time. It doesn't make sense to download everything at once.

Well, all of the items in an add will have different ID #s, that you would have to manually input each one of to get the values. If you go through it once, and save all 3 million or whatever to a database or something, you would have all of them available. This is a very early project, and I don't even know where we're going to go with it, if anywhere. It's more of just a proof of concept that we can do this, and a fun project for me to learn a few new things.

And thanks, Obfuscator

Edited January 4, 2013 by Blake

Hedgehog · January 4, 2013

Then change the value in the url and run the loop in your program.

Although, I would prefer to download the data as necessary to reduce the startup time.

Blake · January 4, 2013

You would basically have to run it once for every item though, as they aren't likely to be close in ID #s. If walmart has 200 items in their ad, and they have 3 million total, the chance that an item is within 10 id#s of another is very slim. If you were going to do it like that, wouldn't be not even worth it to have a loop?

The startup time isn't a big deal, I don't think.

Hedgehog · January 4, 2013

Remove the first for loop from your page. Move it to your program.

Blake · January 4, 2013

Another update, incase anyone is still following. I changed the code a bit to return one json objects with subobjects of the products I wanted. This works fine on the net63 hosted site I have, and works with the json parser fine. Here's the final code.

 <?php

echo "{";

$id = $_GET['id'];
$end = $id + 5;
$comma = $id;

for ($id; $id < $end ; $id++){

if ($comma != $id){ echo ",";}

echo "\"$id\":";
$dom = new DOMDocument();
$content = file_get_contents('http://www.walmart.com/ip/' . $id);
@$dom->loadHTML($content);
$a = $dom->getElementsByTagName('meta');

for ($i=0; $i < $a->length; $i++) {
if ($a->item($i)->getAttribute('itemprop')){
$b = $a->item($i)->getAttribute('content');
$c = $a->item($i)->getAttribute('itemprop');
$json[$c] = $b;
}


}

echo str_replace('\\/', '/', json_encode($json));
}
echo "}";

?>

You can see what it returns at blake.net63.net/?id=10311410 which will start at cheerios and get the next 5 items.

I downloaded the XAMPP and figured out how to get it set up. However when I run it, I get the same error as before -

[hide] Warning: file_get_contents(http://www.walmart.com/ip/10311410): failed to open stream: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond. in C:\xampp\htdocs\scripts\walmart.php on line 15[/hide]

The localhost vs being hosted on the internet must have some effect on it... Not sure how to fix that. Other than that, it works good. Net63 won't let you go over 10 seconds of processing time though, so I can only get ~15 items.

Web Scraping/Grabbing Data

Recommended Posts

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Create an account or sign in to comment

Create an account

Sign in

Important Information