December 25, 201015 yr Hey, with the discussion about all the bots voting on the wilderness poll going on, I thought I would make my own version of a high scores scraper. I'm a person who loves dealing with lots of data, so I decided I would download the 2 million users and their stats from the overall high scores.Sadly, I realized this wouldn't work after about 3 hours. Because I'm having to use proxies to get the next users from the high scores, this script runs really slowly. After 2 hours, I only had 3.5k users in my database.Either way, I thought it would be nice to show you my code. #!/usr/bin/perl require LWP::UserAgent; use Parallel::ForkManager; use DBI(); $pm = new Parallel::ForkManager(15); sub get_page { my ($page,$proxy) = @_; my $ua = LWP::UserAgent->new; $ua->timeout(10); $ua->agent('Mozilla/5.0'); $ua->env_proxy; if ($proxy) { $ua->proxy(['http', 'ftp'], "http://".$proxy); } my $response = $ua->get($page); if ($response->is_success) { return $response->decoded_content; # or whatever } else { return 0; } } sub get_user_stats { my ($user) = @_; my $pagedata = get_page("http://hiscore.runescape.com/index_lite.ws?player=$user"); my @group = split /\n/,$pagedata; my $i = 0; my $insert_values = ''; foreach (@group) { if ($i<=25) { my ($rank, $level, $exp) = split /,/; if ($rank == -1) { $level = 1; $exp = 0; } $insert_values .= "$level,$exp,"; } else { my ($rank, $score) = split /,/; if ($rank == -1) {$score = 0;} $insert_values .= "$score," } $i++; } $insert_values =~ s/,$//; my $insert = "INSERT INTO highscores VALUES('$user',$insert_values)"; do_query($insert); } sub do_query { my ($query) = @_; $dbh = DBI->connect("DBI:mysql:database=runescape_stats;host=localhost", 'trent', 'password'); $dbh->do($query); $dbh->disconnect(); } do_query("truncate table highscores;"); #Step 1: Get a list of proxies. my @proxies = (); for (my $i = 1;$i<=15;$i++){ $i = sprintf('%02s',$i); $page_data = get_page("http://www.samair.ru/proxy/proxy-$i.htm"); while ($page_data =~ m/<td>(\d+\.\d+\.\d+\.\d+)<script/g){ push @proxies, $1; } } $pagedata = get_page("http://services.runescape.com/m=hiscore/overall.ws?table=0&category_type=0"); $pagedata =~ m/<a href="hiscorepersonal\.ws\?user1=(.+?)">\1<\/a>/; $user = $1; $page = 1; get_user_stats($user); while ($user) { $pagedata = ''; $attempts = 1; while (!$pagedata && $#proxies){ print "Attempt $attempts of page $page\n"; my $rand_key = int(rand($#proxies)); $pagedata = get_page("http://services.runescape.com/m=hiscore/overall.ws?table=0&category_type=0&user=$user",$proxies[$rand_key]); if (!$pagedata) { print "Removing proxy ".$proxies[$rand_key]."\n"; splice(@array, $rand_key, 1); $attempts++; } else { print "Got pagedata for page $page\n"; my ($used_users,$new_users) = split (/<a style="color:#F3C334;"/,$pagedata); if (!$new_users) { print "Removing proxy ".$proxies[$rand_key]." for spamming me...\n"; splice(@array, $rand_key, 1); $attempts++; } else { $user = ''; while ($new_users =~ m/<a href="hiscorepersonal\.ws\?user1=(.+?)">\1<\/a>/g){ $user = $1; my $pid = $pm->start and next; print "Getting stats for $user\n"; get_user_stats($user); $pm->finish; } } } } $page++; } ~M
December 25, 201015 yr Thats pretty cool. I am no good perl or php, how are you getting the next user to grab the stats from. I'm going to try this out in C# :P
December 25, 201015 yr Author Thats pretty cool. I am no good perl or php, how are you getting the next user to grab the stats from. I'm going to try this out in C# :PSimple regexes. The first thing you'll notice that I do is grab the first user on the list and reload the list with that user selected. That user is highlighted with a different font colour than the rest of the users, effectivly splitting the list into two sections, one that i've already processed, and one that I haven't. Then I do a global regex search using /<a href="hiscorepersonal\.ws\?user1=(.+?)">\1<\/a>/ on the list of users I haven't processed and loop through the results. ~M
December 25, 201015 yr Okay I am just grabbing the users and adding them to a database at this point, but basically with my code which is below I can process 1000 users a minute meaning itsgoing to take 33 hours to get all of the users :P And it will probobly double when I have to grab each members stats, Im stoked to test this out. Runs on a seperate thread gives me the total amount of usernames processed and the time elapsed. using System; using System.Collections.Generic; using System.ComponentModel; using System.Data; using System.Drawing; using System.Linq; using System.Text; using System.Windows.Forms; using System.Threading; namespace RSGrab { public partial class Form1 : Form { int CurrentPageList = 1; int usernamec = 0; System.Diagnostics.Stopwatch wat = new System.Diagnostics.Stopwatch(); public delegate void UpdateForm(string text); public Form1() { InitializeComponent(); } private void kryptonButton1_Click(object sender, EventArgs e) { wat.Start(); timer1.Start(); Thread ms = new Thread(MainStuff); ms.Start(); } private void MainStuff() { DateTime eventtime1 = DateTime.Now; while (CurrentPageList < 1000000) { string aData = getPageSource(getCurrentUrl()); //Grab the source aData = StripTagsCharArray(aData); //Gets Rid of <html> tags aData.Trim(); //Trim all whitespaces beginning end end string[] bData = getActualData(aData); //Takes only the data needed //Now that I have the data I need I will parse it for each person int count = 0; int line = 1; while (count < 88) { string name = ""; //Account Name - Only grabbing the Username because I will grab the //entire highscores for each member. if (line == 1) { count++; line++; } if (line == 2) { name = bData[count]; count++; line++; } if (line == 3) { count++; line++; } if (line == 4) { ListBox1.Invoke(new UpdateForm(this.AddUserNameXml), new object[] { name }); count++; line = 1; } } CurrentPageList = CurrentPageList + 22; } DateTime eventtime2 = DateTime.Now; TimeSpan elapsed = eventtime1 - eventtime2; MessageBox.Show(elapsed.ToString()); } private void AddUserNameXml(string username) { usernamec++; label2.Text = usernamec.ToString(); } private string[] getActualData(string data) { string[] aData = data.Split('\r', '\n'); string bData = ""; int count = 0; while (count < aData.Length) { if (aData[count].Trim() != "") { bData += aData[count].ToString() + "%"; } count++; } bData.Trim(); string[] cData = bData.Split('%'); int newcount = 122; int newcount2 = 0; string[] final = new string[89]; while (newcount <= 209) { final[newcount2] = cData[newcount]; newcount++; newcount2++; } return final; } private string getCurrentUrl() { string url = "http://services.runescape.com/m=hiscore/overall.ws?rank=" + CurrentPageList.ToString() + "&table=0&scroll=true&category_type=0"; return url; } private string getPageSource(string url) { System.Net.WebClient wb = new System.Net.WebClient(); string strSource = wb.DownloadString(url); wb.Dispose(); return strSource; } private static string StripTagsCharArray(string source) { char[] array = new char[source.Length]; int arrayIndex = 0; bool inside = false; for (int i = 0; i < source.Length; i++) { char let = source[i]; if (let == '<') { inside = true; continue; } if (let == '>') { inside = false; continue; } if (!inside) { array[arrayIndex] = let; arrayIndex++; } } return new string(array, 0, arrayIndex); } private void timer1_Tick(object sender, EventArgs e) { label4.Text = Math.Round((decimal)wat.Elapsed.TotalMinutes, 2).ToString(); } } }
December 25, 201015 yr Author Okay I am just grabbing the users and adding them to a database at this point, but basically with my code which is below I can process 1000 users a minute meaning itsgoing to take 33 hours to get all of the users :P And it will probobly double when I have to grab each members stats, Im stoked to test this out. Runs on a seperate thread gives me the total amount of usernames processed and the time elapsed.How are you avoiding being banned from accessing the high scores list? Last time I tried to do this, I got banned from accessing the page(not the lite high scores). ~M
December 25, 201015 yr its not actually grabbing the highscores yet just grabbing each user. from thishttp://services.runescape.com/m=hiscore/overall.ws?rank=1&table=0&scroll=true&category_type=0 every 22 usernames that are processed a global variable which starts at 1 goes up by 22 and the link changes respectively: "http://services.runescape.com/m=hiscore/overall.ws?rank=" + globalvariable + "&table=0&scroll=true&category_type=0" which downloads the next source. About the banning, I dont no I haven't been yet anyway. What I am going to do is download each username into an xml database a new one every 100k which will end me up with 20 of these 100k databases. Then my new program will be multithreaded (20 threads) which will grab the highscores similtaniously hopefully making this faster. I made a mistake when I said it will double when I access the highscores it will probobly slow down by *22 times :( and the final database should be around 3gigs
December 25, 201015 yr Author its not actually grabbing the highscores yet just grabbing each user. from thishttp://services.runescape.com/m=hiscore/overall.ws?rank=1&table=0&scroll=true&category_type=0 every 22 usernames that are processed a global variable which starts at 1 goes up by 22 and the link changes respectively: "http://services.runescape.com/m=hiscore/overall.ws?rank=" + globalvariable + "&table=0&scroll=true&category_type=0" which downloads the next source. About the banning, I dont no I haven't been yet anyway. What I am going to do is download each username into an xml database a new one every 100k which will end me up with 20 of these 100k databases. Then my new program will be multithreaded (20 threads) which will grab the highscores similtaniously hopefully making this faster. I made a mistake when I said it will double when I access the highscores it will probobly slow down by *22 times :( and the final database should be around 3gigsWeird that you aren't getting banned. I just did a test, and I was only able to get about 20k users before they banned me and suggested that I use the lite high scores. ~M
December 25, 201015 yr couldnt you continue from when you left off? Give 10 minutes then restart but at the point you left off?
December 25, 201015 yr Ya my program kicks off at 21450 users ill just wait a while and restart from there.
December 26, 201015 yr Author couldnt you continue from when you left off? Give 10 minutes then restart but at the point you left off?Yeah, I realized that I could just make it sleep for a while if it doesn't find any users. But, I'm on another project now, so that'll happen some other time. ~M
January 1, 201115 yr You could try going with users in each of the skill lists, I'd bet you'd get closer to 3-4 million unique users than just the top 2 million skill totals will provide. Of course you'd have to merge your lists, which may or may not be a problem. 99 dungeoneering achieved, thanks to everyone that celebrated with me! ♪♪ Don't interrupt me as I struggle to complete this thoughtHave some respect for someone more forgetful than yourself ♪♪♪♪ And I'm not doneAnd I won't be till my head falls off ♪♪
January 6, 201115 yr Author Of course you'd have to merge your lists, which may or may not be a problem.If you work it down to a text list of names for each skill, GNU sort's -u flag removes duplicate values and -o lets you specify an output file.sort -u [list1] [list2] [list3]... -o mergedlist I don't foresee any problem with mpm's database, just add all the users and:SELECT DISTINCT username FROM highscores;(or some such) $dbh = DBI->connect("DBI:mysql:database=runescape_stats;host=localhost", 'trent', 'password'); I'm on another projectIs it a password generator, by chance? :DI've made a million password generators :P Edit: Ohhh, I just got what you meant. That's not my real password :P ~M
Create an account or sign in to comment