MPM Posted December 25, 2010 Share Posted December 25, 2010 Hey, with the discussion about all the bots voting on the wilderness poll going on, I thought I would make my own version of a high scores scraper. I'm a person who loves dealing with lots of data, so I decided I would download the 2 million users and their stats from the overall high scores.Sadly, I realized this wouldn't work after about 3 hours. Because I'm having to use proxies to get the next users from the high scores, this script runs really slowly. After 2 hours, I only had 3.5k users in my database.Either way, I thought it would be nice to show you my code. #!/usr/bin/perl require LWP::UserAgent; use Parallel::ForkManager; use DBI(); $pm = new Parallel::ForkManager(15); sub get_page { my ($page,$proxy) = @_; my $ua = LWP::UserAgent->new; $ua->timeout(10); $ua->agent('Mozilla/5.0'); $ua->env_proxy; if ($proxy) { $ua->proxy(['http', 'ftp'], "http://".$proxy); } my $response = $ua->get($page); if ($response->is_success) { return $response->decoded_content; # or whatever } else { return 0; } } sub get_user_stats { my ($user) = @_; my $pagedata = get_page("http://hiscore.runescape.com/index_lite.ws?player=$user"); my @group = split /\n/,$pagedata; my $i = 0; my $insert_values = ''; foreach (@group) { if ($i<=25) { my ($rank, $level, $exp) = split /,/; if ($rank == -1) { $level = 1; $exp = 0; } $insert_values .= "$level,$exp,"; } else { my ($rank, $score) = split /,/; if ($rank == -1) {$score = 0;} $insert_values .= "$score," } $i++; } $insert_values =~ s/,$//; my $insert = "INSERT INTO highscores VALUES('$user',$insert_values)"; do_query($insert); } sub do_query { my ($query) = @_; $dbh = DBI->connect("DBI:mysql:database=runescape_stats;host=localhost", 'trent', 'password'); $dbh->do($query); $dbh->disconnect(); } do_query("truncate table highscores;"); #Step 1: Get a list of proxies. my @proxies = (); for (my $i = 1;$i<=15;$i++){ $i = sprintf('%02s',$i); $page_data = get_page("http://www.samair.ru/proxy/proxy-$i.htm"); while ($page_data =~ m/<td>(\d+\.\d+\.\d+\.\d+)<script/g){ push @proxies, $1; } } $pagedata = get_page("http://services.runescape.com/m=hiscore/overall.ws?table=0&category_type=0"); $pagedata =~ m/<a href="hiscorepersonal\.ws\?user1=(.+?)">\1<\/a>/; $user = $1; $page = 1; get_user_stats($user); while ($user) { $pagedata = ''; $attempts = 1; while (!$pagedata && $#proxies){ print "Attempt $attempts of page $page\n"; my $rand_key = int(rand($#proxies)); $pagedata = get_page("http://services.runescape.com/m=hiscore/overall.ws?table=0&category_type=0&user=$user",$proxies[$rand_key]); if (!$pagedata) { print "Removing proxy ".$proxies[$rand_key]."\n"; splice(@array, $rand_key, 1); $attempts++; } else { print "Got pagedata for page $page\n"; my ($used_users,$new_users) = split (/<a style="color:#F3C334;"/,$pagedata); if (!$new_users) { print "Removing proxy ".$proxies[$rand_key]." for spamming me...\n"; splice(@array, $rand_key, 1); $attempts++; } else { $user = ''; while ($new_users =~ m/<a href="hiscorepersonal\.ws\?user1=(.+?)">\1<\/a>/g){ $user = $1; my $pid = $pm->start and next; print "Getting stats for $user\n"; get_user_stats($user); $pm->finish; } } } } $page++; } ~M Link to comment Share on other sites More sharing options...
Mastermosley Posted December 25, 2010 Share Posted December 25, 2010 Thats pretty cool. I am no good perl or php, how are you getting the next user to grab the stats from. I'm going to try this out in C# :P Link to comment Share on other sites More sharing options...
MPM Posted December 25, 2010 Author Share Posted December 25, 2010 Thats pretty cool. I am no good perl or php, how are you getting the next user to grab the stats from. I'm going to try this out in C# :PSimple regexes. The first thing you'll notice that I do is grab the first user on the list and reload the list with that user selected. That user is highlighted with a different font colour than the rest of the users, effectivly splitting the list into two sections, one that i've already processed, and one that I haven't. Then I do a global regex search using /<a href="hiscorepersonal\.ws\?user1=(.+?)">\1<\/a>/ on the list of users I haven't processed and loop through the results. ~M Link to comment Share on other sites More sharing options...
Mastermosley Posted December 25, 2010 Share Posted December 25, 2010 Okay I am just grabbing the users and adding them to a database at this point, but basically with my code which is below I can process 1000 users a minute meaning itsgoing to take 33 hours to get all of the users :P And it will probobly double when I have to grab each members stats, Im stoked to test this out. Runs on a seperate thread gives me the total amount of usernames processed and the time elapsed. using System; using System.Collections.Generic; using System.ComponentModel; using System.Data; using System.Drawing; using System.Linq; using System.Text; using System.Windows.Forms; using System.Threading; namespace RSGrab { public partial class Form1 : Form { int CurrentPageList = 1; int usernamec = 0; System.Diagnostics.Stopwatch wat = new System.Diagnostics.Stopwatch(); public delegate void UpdateForm(string text); public Form1() { InitializeComponent(); } private void kryptonButton1_Click(object sender, EventArgs e) { wat.Start(); timer1.Start(); Thread ms = new Thread(MainStuff); ms.Start(); } private void MainStuff() { DateTime eventtime1 = DateTime.Now; while (CurrentPageList < 1000000) { string aData = getPageSource(getCurrentUrl()); //Grab the source aData = StripTagsCharArray(aData); //Gets Rid of <html> tags aData.Trim(); //Trim all whitespaces beginning end end string[] bData = getActualData(aData); //Takes only the data needed //Now that I have the data I need I will parse it for each person int count = 0; int line = 1; while (count < 88) { string name = ""; //Account Name - Only grabbing the Username because I will grab the //entire highscores for each member. if (line == 1) { count++; line++; } if (line == 2) { name = bData[count]; count++; line++; } if (line == 3) { count++; line++; } if (line == 4) { ListBox1.Invoke(new UpdateForm(this.AddUserNameXml), new object[] { name }); count++; line = 1; } } CurrentPageList = CurrentPageList + 22; } DateTime eventtime2 = DateTime.Now; TimeSpan elapsed = eventtime1 - eventtime2; MessageBox.Show(elapsed.ToString()); } private void AddUserNameXml(string username) { usernamec++; label2.Text = usernamec.ToString(); } private string[] getActualData(string data) { string[] aData = data.Split('\r', '\n'); string bData = ""; int count = 0; while (count < aData.Length) { if (aData[count].Trim() != "") { bData += aData[count].ToString() + "%"; } count++; } bData.Trim(); string[] cData = bData.Split('%'); int newcount = 122; int newcount2 = 0; string[] final = new string[89]; while (newcount <= 209) { final[newcount2] = cData[newcount]; newcount++; newcount2++; } return final; } private string getCurrentUrl() { string url = "http://services.runescape.com/m=hiscore/overall.ws?rank=" + CurrentPageList.ToString() + "&table=0&scroll=true&category_type=0"; return url; } private string getPageSource(string url) { System.Net.WebClient wb = new System.Net.WebClient(); string strSource = wb.DownloadString(url); wb.Dispose(); return strSource; } private static string StripTagsCharArray(string source) { char[] array = new char[source.Length]; int arrayIndex = 0; bool inside = false; for (int i = 0; i < source.Length; i++) { char let = source[i]; if (let == '<') { inside = true; continue; } if (let == '>') { inside = false; continue; } if (!inside) { array[arrayIndex] = let; arrayIndex++; } } return new string(array, 0, arrayIndex); } private void timer1_Tick(object sender, EventArgs e) { label4.Text = Math.Round((decimal)wat.Elapsed.TotalMinutes, 2).ToString(); } } } Link to comment Share on other sites More sharing options...
MPM Posted December 25, 2010 Author Share Posted December 25, 2010 Okay I am just grabbing the users and adding them to a database at this point, but basically with my code which is below I can process 1000 users a minute meaning itsgoing to take 33 hours to get all of the users :P And it will probobly double when I have to grab each members stats, Im stoked to test this out. Runs on a seperate thread gives me the total amount of usernames processed and the time elapsed.How are you avoiding being banned from accessing the high scores list? Last time I tried to do this, I got banned from accessing the page(not the lite high scores). ~M Link to comment Share on other sites More sharing options...
Mastermosley Posted December 25, 2010 Share Posted December 25, 2010 its not actually grabbing the highscores yet just grabbing each user. from thishttp://services.runescape.com/m=hiscore/overall.ws?rank=1&table=0&scroll=true&category_type=0 every 22 usernames that are processed a global variable which starts at 1 goes up by 22 and the link changes respectively: "http://services.runescape.com/m=hiscore/overall.ws?rank=" + globalvariable + "&table=0&scroll=true&category_type=0" which downloads the next source. About the banning, I dont no I haven't been yet anyway. What I am going to do is download each username into an xml database a new one every 100k which will end me up with 20 of these 100k databases. Then my new program will be multithreaded (20 threads) which will grab the highscores similtaniously hopefully making this faster. I made a mistake when I said it will double when I access the highscores it will probobly slow down by *22 times :( and the final database should be around 3gigs Link to comment Share on other sites More sharing options...
MPM Posted December 25, 2010 Author Share Posted December 25, 2010 its not actually grabbing the highscores yet just grabbing each user. from thishttp://services.runescape.com/m=hiscore/overall.ws?rank=1&table=0&scroll=true&category_type=0 every 22 usernames that are processed a global variable which starts at 1 goes up by 22 and the link changes respectively: "http://services.runescape.com/m=hiscore/overall.ws?rank=" + globalvariable + "&table=0&scroll=true&category_type=0" which downloads the next source. About the banning, I dont no I haven't been yet anyway. What I am going to do is download each username into an xml database a new one every 100k which will end me up with 20 of these 100k databases. Then my new program will be multithreaded (20 threads) which will grab the highscores similtaniously hopefully making this faster. I made a mistake when I said it will double when I access the highscores it will probobly slow down by *22 times :( and the final database should be around 3gigsWeird that you aren't getting banned. I just did a test, and I was only able to get about 20k users before they banned me and suggested that I use the lite high scores. ~M Link to comment Share on other sites More sharing options...
Mastermosley Posted December 25, 2010 Share Posted December 25, 2010 couldnt you continue from when you left off? Give 10 minutes then restart but at the point you left off? Link to comment Share on other sites More sharing options...
Mastermosley Posted December 25, 2010 Share Posted December 25, 2010 Ya my program kicks off at 21450 users ill just wait a while and restart from there. Link to comment Share on other sites More sharing options...
MPM Posted December 26, 2010 Author Share Posted December 26, 2010 couldnt you continue from when you left off? Give 10 minutes then restart but at the point you left off?Yeah, I realized that I could just make it sleep for a while if it doesn't find any users. But, I'm on another project now, so that'll happen some other time. ~M Link to comment Share on other sites More sharing options...
Mastermosley Posted December 26, 2010 Share Posted December 26, 2010 Awesome, thanks for giving me something to do anyway. Link to comment Share on other sites More sharing options...
sees_all1 Posted January 1, 2011 Share Posted January 1, 2011 You could try going with users in each of the skill lists, I'd bet you'd get closer to 3-4 million unique users than just the top 2 million skill totals will provide. Of course you'd have to merge your lists, which may or may not be a problem. 99 dungeoneering achieved, thanks to everyone that celebrated with me! ♪♪ Don't interrupt me as I struggle to complete this thoughtHave some respect for someone more forgetful than yourself ♪♪♪♪ And I'm not doneAnd I won't be till my head falls off ♪♪ Link to comment Share on other sites More sharing options...
Ember Posted January 3, 2011 Share Posted January 3, 2011 . Link to comment Share on other sites More sharing options...
MPM Posted January 6, 2011 Author Share Posted January 6, 2011 Of course you'd have to merge your lists, which may or may not be a problem.If you work it down to a text list of names for each skill, GNU sort's -u flag removes duplicate values and -o lets you specify an output file.sort -u [list1] [list2] [list3]... -o mergedlist I don't foresee any problem with mpm's database, just add all the users and:SELECT DISTINCT username FROM highscores;(or some such) $dbh = DBI->connect("DBI:mysql:database=runescape_stats;host=localhost", 'trent', 'password'); I'm on another projectIs it a password generator, by chance? :DI've made a million password generators :P Edit: Ohhh, I just got what you meant. That's not my real password :P ~M Link to comment Share on other sites More sharing options...
Recommended Posts
Create an account or sign in to comment
You need to be a member in order to leave a comment
Create an account
Sign up for a new account in our community. It's easy!
Register a new accountSign in
Already have an account? Sign in here.
Sign In Now