I quite often have to add new films into the database of my website british-film-locations.com. As well as adding the film I have to add data about the director and actors in the film as well. This is quite time consuming, so I created a system to simplify the task.
Firstly, I go to the page for the film on the Internet Movie Database, then I click on a specially-written bookmarklet. This runs some Javascript code which takes the IMDb URL, strips out the ID number of the film, and sends it to a PHP script on my website.
The PHP script uses the ID number to scrape the IMDb site to get the data it needs, such as the film title, year of release, director’s name and ID, the actor’s names and IDs, etc. It then populates a form with all this data, and all I have to do is add any extra info, then type my password and click ‘Go’ and it adds the film, actors, and director to the database for me automatically. Brilliant!
Recently I decided to start an online database for all my ZX Spectrum games. I thought I could do something similar using data from the World Of Spectrum website’s games database.
Now, the World Of Spectrum used to have an API to allow you to get data in XML or JSON format, but it hasn’t worked for a while. So I just wrote something like the system above to send the ID from the URL of the required game to a PHP script which scrapes the data directly from the site. Unfortunately this didn’t work. The World Of Spectrum website obviously has a system in place to stop you scraping their site.
It probably isn’t too difficult to circumvent this restriction – I imagine it would involve getting my PHP script to spoof its User Agent. But then a thought struck me: I already have the page with all the game’s data open in my web browser. Rather than sending info to a PHP script which then opens the page again to get the data, why not just scrape the data from my web browser window using Javascript and send it directly to the PHP script instead?
So that’s what I did!
The site doesn’t really give you much to go on in terms of finding the required data in the document (here’s an example), so I basically just cycle through every tag on the page looking for recognisable text and then get the contents of the tag which is a particular offset away. A bit crude, perhaps, but it works.
It then stuffs the data onto the end of the PHP script’s URL as a query string, and opens it in a new window.
The code could be optimised for size, but all modern browsers allow bookmarklets of 2000+ characters, so there’s no real point.
An extra advantage of doing it this way is that you can scrape dynamically generated content as well. As long as the data you want is in the DOM when you click on the bookmarklet, it can be scraped.
Something tells me that there’s a really cool application for this, I just need to think of one!