No API? No Worrries.

Copying and pasting: not the way to get people to use your site.

When Munzi Codes and I first made the Bechdelerator during a hackathon a while ago, we needed to get movie scripts so we could analyze them.

Unfortunately, we soon discovered that there was only one real source for them and it was a bit of a pain to work with. Here is a screen shot of where the scripts came from:


There was no API, no easy way to predict what the links to the scripts would be, and no apparent way for users to quickly get a script for a specific movie. Since we had a limited amount of time, we focused on what would be more impressive for the hackathon- analyzing the movie scripts, making a graph using D3 that represented conversations between characters in the movie, and predicting whether the movie passed the Bechdel test (for more information on this test and what we made, check out the link to above to the site).

The only way for people to use our site was to manually copy and paste the script from the website. Clearly a less than ideal solution, and even our own friends and family did not want to use the site.

A better alternative: scrape the web for the information you need

One of my favorite things about Node.Js are the many awesome libraries that are written for it. When we finally decided that copying and pasting needed to end and wanted to find a way to get the information directly from IMDSB, we quickly decided on using Cheerio to scrape the information we needed from the site and the Request library to make requests to the site.

Cheerio is awesome because it allows you to essentially write jQuery and use CSS selectors to get content from a website.

For example, here is part of the HTML from the IMDSB website:


<td valign="top">  
    <h1>All Movie Scripts on IMSDb (A-Z)</h1>
        <a href="/Movie Scripts/10 Things I Hate About You Script.html" title="10 Things I Hate About You Script">10 Things I Hate About You</a>(1997-11 Draft )
        <br><i>Written by Karen McCullah Lutz,Kirsten Smith,William Shakespeare</i>

To get the data we need, after installing and requiring the Cheerio and Request node modules (you can do this easily with npm install cheerio or npm install request), we can now do things like this:


request(allMovies, function(err, response, html){  
        var $ = cheerio.load(html);
        $ = cheerio.load($("h1:contains(\"All Movie Scripts\")").parent()[0])
        var movieTable = $("h1:contains(\"All Movie Scripts\")").parent();
        var allMovieTitles = $('a[href^="/Movie Scripts"]').map(function(thing){ 
            return $(this).text()
        allMovieTitles =;
        res.render('./index.html', {allMovieTitles: allMovieTitles});

The code above has previously loaded all of the HTML from IMDSB using the request library. We need to then get the table that contains all the links to movie scripts. Since we can see from the HTML of the page that this table has a child with the text "All Movie Scripts", we can use

$("h1:contains(\"All Movie Scripts\")").parent(); 
to get the table element.

Next, since the every movie script link has '/Movie Scripts' in its URL, we can use

$('a[href^="/Movie Scripts"]')
to get an array like object of every <a> element that has a link to a movie script. Note that while this is an array like object an we can use .map on it, it is not actually an array and that is why we use on it to convert it to an actual array.

Now that we have the list of all movie script links, we can render a page that uses this data using a template engine (we used Swig).

Once a user selects a script from our list of all scripts, we then use the URL of the script and the cheerio library to actually get the text of the script and then analyze it with our algorithm.

Cheerio makes crawling HTML pages very simple since you can just use CSS selectors (just like you do in jQuery) to select the page elements you want. It is a great option when working with data sources that have no API or other way to easily access the information you need. I definitely recommend checking it out and adding it to your next project.

Show Comments