Jump to content

Script I made: Scrape the entire High scores


MPM

Recommended Posts

Hey, with the discussion about all the bots voting on the wilderness poll going on, I thought I would make my own version of a high scores scraper.

I'm a person who loves dealing with lots of data, so I decided I would download the 2 million users and their stats from the overall high scores.

Sadly, I realized this wouldn't work after about 3 hours. Because I'm having to use proxies to get the next users from the high scores, this script runs really slowly. After 2 hours, I only had 3.5k users in my database.

Either way, I thought it would be nice to show you my code.

#!/usr/bin/perl
require LWP::UserAgent;
use Parallel::ForkManager;
use DBI();
$pm = new Parallel::ForkManager(15);

sub get_page {
my ($page,$proxy) = @_;
my $ua = LWP::UserAgent->new;
$ua->timeout(10);
$ua->agent('Mozilla/5.0');
$ua->env_proxy;
if ($proxy) {
    $ua->proxy(['http', 'ftp'], "http://".$proxy);
}
my $response = $ua->get($page);

if ($response->is_success) {
    return $response->decoded_content;  # or whatever
}
else {
   return 0;
}
}
sub get_user_stats {
   my ($user) = @_;
   my $pagedata = get_page("http://hiscore.runescape.com/index_lite.ws?player=$user");
   my @group = split /\n/,$pagedata;
   my $i = 0;
   my $insert_values = '';
   foreach (@group) {
       if ($i<=25) {
           my ($rank, $level, $exp) = split /,/;
           if ($rank == -1) {
               $level = 1;
               $exp = 0;
           }
           $insert_values .= "$level,$exp,";
       }
       else {
           my ($rank, $score) = split /,/;
           if ($rank == -1) {$score = 0;}
           $insert_values .= "$score,"
       }
       $i++;
   }
   $insert_values =~ s/,$//;
   my $insert = "INSERT INTO highscores VALUES('$user',$insert_values)";
   do_query($insert);

}
sub do_query {
   my ($query) = @_;
   $dbh = DBI->connect("DBI:mysql:database=runescape_stats;host=localhost", 'trent', 'password');
   $dbh->do($query);
   $dbh->disconnect();

}
do_query("truncate table highscores;");
#Step 1: Get a list of proxies.
my @proxies = ();
for (my $i = 1;$i<=15;$i++){
   $i = sprintf('%02s',$i);
   $page_data = get_page("http://www.samair.ru/proxy/proxy-$i.htm");
   while ($page_data =~ m/<td>(\d+\.\d+\.\d+\.\d+)<script/g){
       push @proxies, $1;
   }
}
$pagedata = get_page("http://services.runescape.com/m=hiscore/overall.ws?table=0&category_type=0");
$pagedata =~ m/<a href="hiscorepersonal\.ws\?user1=(.+?)">\1<\/a>/;
$user = $1;
$page = 1;
get_user_stats($user);
while ($user) {
   $pagedata = '';
   $attempts = 1;
   while (!$pagedata && $#proxies){
       print "Attempt $attempts of page $page\n";
       my $rand_key = int(rand($#proxies));
       $pagedata = get_page("http://services.runescape.com/m=hiscore/overall.ws?table=0&category_type=0&user=$user",$proxies[$rand_key]);
       if (!$pagedata) {
           print "Removing proxy ".$proxies[$rand_key]."\n";
           splice(@array, $rand_key, 1);
           $attempts++;
       }
       else {
           print "Got pagedata for page $page\n";
           my ($used_users,$new_users) = split (/<a style="color:#F3C334;"/,$pagedata);
           if (!$new_users) {
               print "Removing proxy ".$proxies[$rand_key]." for spamming me...\n";
               splice(@array, $rand_key, 1);
               $attempts++;
           }
           else {
               $user = '';
               while ($new_users =~ m/<a href="hiscorepersonal\.ws\?user1=(.+?)">\1<\/a>/g){
                  $user = $1;
                  my $pid = $pm->start and next;
                  print "Getting stats for $user\n";
                  get_user_stats($user);
                  $pm->finish;
               }
           }
       }
   }
   $page++;
}



~M

Link to comment
Share on other sites

Thats pretty cool. I am no good perl or php, how are you getting the next user to grab the stats from. I'm going to try this out in C# :P

Simple regexes. The first thing you'll notice that I do is grab the first user on the list and reload the list with that user selected. That user is highlighted with a different font colour than the rest of the users, effectivly splitting the list into two sections, one that i've already processed, and one that I haven't.

Then I do a global regex search using /<a href="hiscorepersonal\.ws\?user1=(.+?)">\1<\/a>/ on the list of users I haven't processed and loop through the results.

~M

Link to comment
Share on other sites

Okay I am just grabbing the users and adding them to a database at this point, but basically with my code which is below I can process 1000 users a minute meaning its

going to take 33 hours to get all of the users :P And it will probobly double when I have to grab each members stats, Im stoked to test this out.

 

Runs on a seperate thread gives me the total amount of usernames processed and the time elapsed.

 

using System;
using System.Collections.Generic;
using System.ComponentModel;
using System.Data;
using System.Drawing;
using System.Linq;
using System.Text;
using System.Windows.Forms;
using System.Threading;

namespace RSGrab
{
   public partial class Form1 : Form
   {
       int CurrentPageList = 1;
       int usernamec = 0;
       System.Diagnostics.Stopwatch wat = new System.Diagnostics.Stopwatch();
       public delegate void UpdateForm(string text);
       public Form1()
       {
           InitializeComponent();
       }

       private void kryptonButton1_Click(object sender, EventArgs e)
       {

           wat.Start();
           timer1.Start();
           Thread ms = new Thread(MainStuff);
           ms.Start();


       }
       private void MainStuff()
       {
           DateTime eventtime1 = DateTime.Now;
           while (CurrentPageList < 1000000)
           {
               string aData = getPageSource(getCurrentUrl()); //Grab the source
               aData = StripTagsCharArray(aData); //Gets Rid of <html> tags
               aData.Trim(); //Trim all whitespaces beginning end end
               string[] bData = getActualData(aData); //Takes only the data needed
               //Now that I have the data I need I will parse it for each person
               int count = 0;
               int line = 1;
               while (count < 88)
               {
                   string name = ""; //Account Name - Only grabbing the Username because I will grab the
                   //entire highscores for each member.
                   if (line == 1)
                   {
                       count++;
                       line++;
                   }
                   if (line == 2)
                   {
                       name = bData[count];
                       count++;
                       line++;
                   }
                   if (line == 3)
                   {
                       count++;
                       line++;
                   }
                   if (line == 4)
                   {
                       ListBox1.Invoke(new UpdateForm(this.AddUserNameXml), new object[] { name });
                       count++;
                       line = 1;
                   }
               }

               CurrentPageList = CurrentPageList + 22;
           }
           DateTime eventtime2 = DateTime.Now;
           TimeSpan elapsed = eventtime1 - eventtime2;
           MessageBox.Show(elapsed.ToString());
       }

       private void AddUserNameXml(string username)
       {
           usernamec++;
           label2.Text = usernamec.ToString();
       }
       private string[] getActualData(string data)
       {
           string[] aData = data.Split('\r', '\n');
           string bData = "";
           int count = 0;
           while (count < aData.Length)
           {
               if (aData[count].Trim() != "")
               {
                   bData += aData[count].ToString() + "%";
               }
               count++;
           }
           bData.Trim();
           string[] cData = bData.Split('%');
           int newcount = 122;
           int newcount2 = 0;
           string[] final = new string[89];
           while (newcount <= 209)
           {
               final[newcount2] = cData[newcount];
               newcount++;
               newcount2++;
           }
           return final;

       }
       private string getCurrentUrl()
       {
           string url = "http://services.runescape.com/m=hiscore/overall.ws?rank=" + CurrentPageList.ToString() + "&table=0&scroll=true&category_type=0";
           return url;
       }
       private string getPageSource(string url)
       {
           System.Net.WebClient wb = new System.Net.WebClient();
           string strSource = wb.DownloadString(url);
           wb.Dispose();
           return strSource;
       }
       private static string StripTagsCharArray(string source)
       {
           char[] array = new char[source.Length];
           int arrayIndex = 0;
           bool inside = false;

           for (int i = 0; i < source.Length; i++)
           {
               char let = source[i];
               if (let == '<')
               {
                   inside = true;
                   continue;
               }
               if (let == '>')
               {
                   inside = false;
                   continue;
               }
               if (!inside)
               {
                   array[arrayIndex] = let;
                   arrayIndex++;
               }
           }
           return new string(array, 0, arrayIndex);
       }

       private void timer1_Tick(object sender, EventArgs e)
       {
           label4.Text = Math.Round((decimal)wat.Elapsed.TotalMinutes, 2).ToString();
       }


   }
}

Link to comment
Share on other sites

Okay I am just grabbing the users and adding them to a database at this point, but basically with my code which is below I can process 1000 users a minute meaning its

going to take 33 hours to get all of the users :P And it will probobly double when I have to grab each members stats, Im stoked to test this out.

 

Runs on a seperate thread gives me the total amount of usernames processed and the time elapsed.

How are you avoiding being banned from accessing the high scores list? Last time I tried to do this, I got banned from accessing the page(not the lite high scores).

~M

Link to comment
Share on other sites

its not actually grabbing the highscores yet just grabbing each user. from this

http://services.runescape.com/m=hiscore/overall.ws?rank=1&table=0&scroll=true&category_type=0 every 22 usernames that are processed a global variable which starts at 1 goes up by 22 and the

link changes respectively: "http://services.runescape.com/m=hiscore/overall.ws?rank=" + globalvariable + "&table=0&scroll=true&category_type=0" which downloads the next source. About the banning, I dont no I haven't been yet anyway. What I am going to do is download each username into an xml database a new one every 100k which will end me up with 20 of these 100k databases. Then my new program will be multithreaded (20 threads) which will grab the highscores similtaniously hopefully making this faster.

 

I made a mistake when I said it will double when I access the highscores it will probobly slow down by *22 times :( and the final database should be around 3gigs

Link to comment
Share on other sites

its not actually grabbing the highscores yet just grabbing each user. from this

http://services.runescape.com/m=hiscore/overall.ws?rank=1&table=0&scroll=true&category_type=0 every 22 usernames that are processed a global variable which starts at 1 goes up by 22 and the

link changes respectively: "http://services.runescape.com/m=hiscore/overall.ws?rank=" + globalvariable + "&table=0&scroll=true&category_type=0" which downloads the next source. About the banning, I dont no I haven't been yet anyway. What I am going to do is download each username into an xml database a new one every 100k which will end me up with 20 of these 100k databases. Then my new program will be multithreaded (20 threads) which will grab the highscores similtaniously hopefully making this faster.

 

I made a mistake when I said it will double when I access the highscores it will probobly slow down by *22 times :( and the final database should be around 3gigs

Weird that you aren't getting banned. I just did a test, and I was only able to get about 20k users before they banned me and suggested that I use the lite high scores.

~M

Link to comment
Share on other sites

couldnt you continue from when you left off? Give 10 minutes then restart but at the point you left off?

Yeah, I realized that I could just make it sleep for a while if it doesn't find any users. But, I'm on another project now, so that'll happen some other time.

~M

Link to comment
Share on other sites

You could try going with users in each of the skill lists, I'd bet you'd get closer to 3-4 million unique users than just the top 2 million skill totals will provide. Of course you'd have to merge your lists, which may or may not be a problem.

99 dungeoneering achieved, thanks to everyone that celebrated with me!

 

♪♪ Don't interrupt me as I struggle to complete this thought
Have some respect for someone more forgetful than yourself ♪♪

♪♪ And I'm not done
And I won't be till my head falls off ♪♪

Link to comment
Share on other sites

Of course you'd have to merge your lists, which may or may not be a problem.

If you work it down to a text list of names for each skill, GNU sort's -u flag removes duplicate values and -o lets you specify an output file.

sort -u [list1] [list2] [list3]... -o mergedlist

 

I don't foresee any problem with mpm's database, just add all the users and:

SELECT DISTINCT username FROM highscores;

(or some such)

 

 

 

   $dbh = DBI->connect("DBI:mysql:database=runescape_stats;host=localhost", 'trent', 'password');

I'm on another project

Is it a password generator, by chance? :D

I've made a million password generators :P

 

Edit: Ohhh, I just got what you meant. That's not my real password :P

~M

Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
×
×
  • Create New...

Important Information

By using this site, you agree to our Terms of Use.