das1330 Posted July 18, 2012 Share Posted July 18, 2012 I am currently working on a project that requires me to pull a large amount of information information directly off a website. All the info is in a standard format, but it's not delimited into a table or anything. I was wondering, does anyone know of a program or other method to efficiently catalog this information for me into a table? Link to comment Share on other sites More sharing options...
obfuscator Posted July 19, 2012 Share Posted July 19, 2012 Could you provide the website? "It's not a rest for me, it's a rest for the weights." - Dom Mazzetti Link to comment Share on other sites More sharing options...
Corvus Posted July 19, 2012 Share Posted July 19, 2012 Well, it all depends on the source and the amount of data you are trying to collect. The easiest way to gather the data if they exist, would be an external API made by the site for the specific purpose of gathering the data. If that fails you could use any number of different programming or scripting languages to automate the proccess of accessing and parsing the web pages and formatting the output file (Not sure what, if any programming languages you know, so I will just suggest going to google to search for how you can parse html with the programming language of your choice) If it is a somewhat small number of pieces of data, you may save time by just doing it manually. Thanks to DrCue at DeviantArt for the signature source Link to comment Share on other sites More sharing options...
obfuscator Posted July 19, 2012 Share Posted July 19, 2012 Whatever you do, don't use Regex :P "It's not a rest for me, it's a rest for the weights." - Dom Mazzetti Link to comment Share on other sites More sharing options...
das1330 Posted July 19, 2012 Author Share Posted July 19, 2012 I need the database for a college schedule builder I'm working on. The website I need to pull from can be found here: http://www.registrar.fas.harvard.edu/fasro/courses/index.jsp?cat=ugrad&subcat=courses (click on each individual department name for the list). Ideally, it would pull down the description, term offered, and other relevant information as listed. No API is available to me. Link to comment Share on other sites More sharing options...
obfuscator Posted July 19, 2012 Share Posted July 19, 2012 What programming language are you using? Basically you should use an xml parser to grab this data....so for C# for instance you could use LINQ to XML... "It's not a rest for me, it's a rest for the weights." - Dom Mazzetti Link to comment Share on other sites More sharing options...
das1330 Posted July 19, 2012 Author Share Posted July 19, 2012 I was thinking Java, since I'm proficient in it, but I'm still at a fairly early stage of planning at this point. Note that I don't need the system to dynamically update - the course catalog is only altered infrequently, so manual updates should not be problematic. Link to comment Share on other sites More sharing options...
Recommended Posts
Create an account or sign in to comment
You need to be a member in order to leave a comment
Create an account
Sign up for a new account in our community. It's easy!
Register a new accountSign in
Already have an account? Sign in here.
Sign In Now