You heard the Gospel. This is the decade of Artificial Intelligence. You picked up Machine Learning (this is how we insiders say, right?) as a hobby, you are prepping your next career move or you’re a young scientist seriously thinking about leaving academia to go be successful in the new economy. You’ve heard many times about Kaggle, the de-facto king of ML competition platforms.
So you signed up to a competition all pumped up, aiming for a top tier position. A bronze medal at least. You’ve gone through all the online courses. You’re ready for the big leagues. You know in your heart that your cleverness and skills will prevail, that you will find that magic feature, or get that angle on the problem that nobody got. After all you are different. Your EDAs are magical, you read the forum’s posts and absorbed all their wisdom.
You heard the Gospel. This is the decade of Artificial Intelligence.
You worked your ass off but the competition’s deadline is dreadfully close now and you don’t have the magic bullet yet. You’ve spent copious amount of time sparring with memory efficiency (8Gb of RAM here) and no matter how well you built you model, run time on your machine seems to never be good enough. As the competition advanced, you found yourself more and more often staring at your screen, waiting for your model to finish running, trying not to get too distracted yet looking for ways not to waste your time. ‘There’s too much work left to waste a day or two learning how to run my models on the cloud’ you thought.
Then, before you know it, the competition ends and the hammer drops. You did not make it to the top 10, more like top 1000…. It’s 5am, you’ve barely slept for the last 3 days. You felt the burn.
The next day, you eagerly read the top solutions, kindly shared by amazing kagglers. You realize your solution was more or less what the top 10 did, except that they had >128Gb of memory and computing beasts with a gazillion cores and shiny expensive GPUs. You feel the burn, again. But that is not a bad thing, believe me. It’s a necessary evil.
AWS comes to the rescue! There is no need to despair, as we live in the age of the cloud. Let me guide you through a few simple steps and within the hour (or 10 minutes if you already have an AWS account), you can let your models run on a 16 core, 64Gb machine for only $0.7 an hour. Now you will really be able to focus on methodology (the real secret of top kagglers).
You can run your models on a 16 core, 64Gb machine for only 70 cents an hour.
So fasten your seat-belt, you’ll have your python script running on an EC2 machine before you can say ‘Whaaa?’.
Note: some online education entities provide AWS educate credits on some of their Data Science courses. Make sure you claim them if that is the case.
Ok, these are not AWS servers, but they look nice…don’t they? We’re not here to talk about rack types but how to get going using them. So, are you ready? Let’s go.
Creating AWS accounts are free and they also give a copious amount of monthly free usage for services qualifying for their Free Tier.
I will assume that you have registered an AWS account, it should be rather quick and straight forward. I will focus on making sure that your AWS account stays in a friendly relationship with your bank account. By experience I can tell you that no matter how many times you’ve heard that you must be consistent in terminating your instances, there will be time when you’ll forget. That is why you should use their new(ish) tools to manage a budget and monitor your use.
- Create & set-up your account. Take some time to learn about the AWS Free Tier benefits.
- Go to the region selection tool and pick ‘Oregon’. It will make transfers of datasets from Kaggle quick and easy. The instances available and types are also very good. If you were to use more local data sources, it would make sense for you to pick another region, otherwise don’t bother.
- Go to Services on the top left of the console and select Compute>EC2.
- You till then find a summary of your Elastic Compute situation. Click on the blue button called ‘Launch Instance’.
- Go to the Community AMIs (Amazon Machine Image) and type “git conda” in the search box. AMIs are pre-configured systems and since this is our first time doing this we’ll source one that can get us going pretty quickly with running out ML python scripts. You should have gotten one result, as shown below. It has python 3.6, anaconda, git and docker. If not, check your region. We won’t get into docker here but within a few lines of code we should be ready to run XGB,lightgbm, sklearn…
- Pick an instance type. There are many of them but my favorite is the m5.4xlarge. With 16 cores and 64Gb of memory for $.7/h, it is for me the best ratio performance/cost. Of course you should pick the instance most adapted to your needs. Pricing is not shown on this page, so I’ll put the link here for easy access. If you feel insecure at this point, select the t2 micro instance, it is included in your Free Tier.
- Click on Review & Launch. Check that all is as you desire (there shouldn’t be any surprises here)
- Click on Launch. Your instance is almost ready. Only remains to generate a key-pair to securely communicate with the EC2 instance. The following window will appear:
- Select “create new key pair” and type a name to give to your private key key. Download your key in a your project’s folder. This will be used to identify you as the owner of the instance. It is crucial that this key isn’t found by someone ill-intentioned so put it somewhere sensible so we can use it later, yet specific enough not to get lost/stolen (not your desktop for example). Hidden folders are reasonable good for that, but your project folder is most likely good enough for now.
- Click on Launch Instance, and then View Instance. Copy its IP address to the clipboard. We will need it to connect to it. Select your instance and click on the “copy to clipboard” button that appear when you hover over “public DNS (IPv4)” field.
At this point your instance is running. We just have to make it do what we want!
For this section, I will give instructions for windows users, as it is not UNIX based and therefore a little trickier as far as server communication goes. If you are on MacOS or Linux, things should be simpler, though more command-line heavy.
To connect to the instance, we need to use SSH, a secure protocol for computer communication. Think HTTPS, but on command line (check this article for a simple introduction to them). In any case, though your browser handles https seamlessly, windows is not equipped with SSH clients. We’ll need a little piece of software that hasn’t been dethroned since its release in 1999: PuTTy. You can download it here. Install the software.
- The first thing we’ll do is convert the private key file from .pem to .ppk, a format PuTTY understands. Launch PuTTYgen and load your private key. (Load button or File>Load private key) Make sure you select All Files (*.*) to see the .pem file you got from Amazon.
- Use a passphrase to secure your key and save your private key (save private key button or File>save private key). You may now delete the .pem file as your key is now safely stored in the .ppk file PuTTY will use.
- Launch PuTTY. You will land on its config window. In 2 easy steps we’ll be working remotely on our AWS instance. First, go to the Connection>SSH>Auth menu item and load your .ppk file.
- If you later experience disconnections, you may go to Connections and change the seconds between keepalives parameter.
- Then go back to Session and enter the information about the instance. In Host Name, enter: ubuntu@my_instance_ip_address (if you picked a non-ubuntu AMI earlier, get your username on this page). Make sure the default Port: 22 and connection type: SSH are selected. Important: Save the Session configuration, we will need it again later. Here I decided to call it “My_new_AWS_instance”.
- Click on Open. Putty will ask you if you’re trusting the remote machine you’re about to connect to. Say yes. It should then take a few seconds to authenticate you before giving you the hand in the console. You are now in control of your instance.
Amazing! You’re at home on your AWS boosted machine. Now let’s get cozy and install the packages we need.
I picked this AMI because unless you’re doing Neural Networks, it’s the closest one to what we need. Anaconda is installed so adding new packages will be a breeze. Run
conda list to see all installed packages. You will find some familiar data science libraries:
- numpy, pandas and scipy
- ipython, jupyter and jupyter lab
- matplotlib, seaborn and bokeh
- scikit-learn and scikit-image
- sqlite and sqlalchemy
But the kings of Kaggle: XGB and LightGBM aren’t there. Let’s install them and add the runner up CatBoost to the mix. Run the following commands:
conda install -c conda-forge xgboost
conda install -c conda-forge lightgbm
conda install -c conda-forge catboost
To compete on Kaggle, we need the datasets. Thankfully Kaggle has a great API that will allow us to download our datasets to the EC2 instance via the command line. The coup de grace is that it will also allow you to submit predictions automatically. But first, you’ll need the API key from Kaggle, so it can identify you when the EC2 instance makes requests.
- Go to My Account on Kaggle.
- Scroll down and click on create new API token.
- Save the kaggle.json file in your project’s folder. This is your API key. Just like the SSH key before, it is to be kept secret.
- Download and install WinSCP. It will make sending local files to the instance really easy. Don’t forget to donate if you end up enjoying it.
- Launch it. If you saved your PuTTY session as I asked you to you’re about to thank me. On the login screen, click on Tools>Import sites. Select the session you saved and then login.
- You now have a browser with local files and remote files side by side. locate the json that contains your kaggle key. Right click on the right panel (the remote folders) and create a new folder called “.kaggle”. The dot makes the folder a hidden folder, for increased safety.
- Click on Open directory (ctrl+o) and type the name of the folder you just created. Upload kaggle.json there (using F5 or drag&drop).
- In PuTTY run
chmod 600 /home/ubuntu/,kaggle/kaggle.jsonto encrypt your key. It is now very safely stored. Also run
pip install kaggle. You are now ready to use the API. Read about its commands here.
- To download your competition’s files run
kaggle competitions listto get your competition’s name. Highlight it (the linux equivalent to ctrl+c). Now type
kaggle competitions download -c my_competition -f data_filewhere you click right instead of typing your competition name (equivalent to ctrl+v) and then type the name of the file you want to download.
That is it my friend. At this point, you can run your script from the command shell. Before you leave though, let’s be safe and set up budget limits to your account.
No matter how many times you’ve heard that you must be consistent in terminating your instances, there will be time when you’ll forget.
That’s why you should really not skip the following steps, otherwise the bill will come haunt you! So follow me.
- Go to your Billing Dashboard > Budgets. Set up a monthly budget. It will notify you when your costs reach a certain threshold.
- No, really, do it. Better safe than sorry. If you are planning on using it a bunch, think about enabling CloudWatch, and setting up its billing alarms.
- Also, download the AWS mobile app so you can easily check what’s running from anywhere.
I hope that this will allow you to spend your time on what’s important: read articles, test new methods and build strong methodology. Now go get these medals!