Scrape IMDB MyMovies with PHP

pimp.pngI have been wanting to somehow mention the movies I have recently seen and what I thought of them on this blog but was not sure how to best do it.
For a while I was running my own custom code to add movie data and ratings and to display them but this was too cumbersome.
Recently I discovered that the movie rating system at IMDB.com keeps track of all the movies you have ever rated (IMDB.com is by the way an amazing resource for movie lovers and I hope that you have used it before).
Since IDMB is the oracle of movies a better approach seemed to be to continue reviewing and discussing movies on there and attempt to mash the data into my blog.
A few regular expressions later and a Pick IMDB MyMovies with PHP (PIMP) script was born.
The script scrapes a given IMDB MyMovies list and provides the results in a handy (two dimensional) array. It is up to the user to then write the display code to make it fit in with a web page (blog).
A basic file based cache has been implemented to save on the amount of hits on IMDB.
At the core of the script is the following regular expression:
/<a href=\"\/title\/([^\/]*)\/([^>]*)>([^<]*)<\/a> \(([0-9]*[\/I]*)\)( \(.\))?<\/td>([^<])<td align=\"center\" bgcolor=\"\#ffffff\">([0-9]{1,2})<\/td><td align=\"center\" bgcolor=\"\#ffffff\"> ([0-9]\.[0-9])?/i
Be aware that if the layout of the IMDB page changes, PIMP will fail. Of course I will be quick to update the regular expression since I am using PIMP my self.
There are some rumours that IMDB will introduce an open API that will allow developers to retrieve all kinds of movie information. Until that is done, crude HTML scraping techniques will have to do.
Feel free to suggest regular expression or code optimisations.
Download PIMP v1.2 and let me know what you think!
Usage:
——
0. Create an account with IMDB and mark a MyMovies list as public
1. To be able to use a cache file create a directory which is writable by the script process (may need chmod 666). For security reasons this directory should be OUTSIDE of the area accessible from the web.
2. Configure the script with your details (list id, cache directory and etc)
3. Upload the myImdbMovies.php file to your web server
4. Access the script directly or better yet use it as an include
Update 26/06/2006
Added a raw HTTP method if file_get_contents doesn’t work for you (hosting provider restriction.
-1 cache time out will skip caching altogether and avoid possible cache file permission problems.
-1 for listItems will result in all items being returned