For my capstone project at Flatiron school, I worked on an NLP project. It brought up a few hardware issues which forced me to find creative solutions, as the only other alternative was to spend money on a new computer.
The first issue arose when I decided to keep all my files, including the source dataset, in the same place. As GitHub was (and still is) hosting my project, I then learnt GitHub limits the size of files.
The second hurdle stood when I had to merge my dataset back together. All the work so far was achieved using my ten year old laptop and, although computing times could be rather long at times, everything was fine. But the concatenation computation was the step too far for my “vintage” hardware.
The last problem stemmed from using XGBoost. The training step is computationally intense, and without a GPU, it takes quite a while! (more on this)
File Size Limitation (on GitHub)
There is a soft 50 MB limit, and a hard 100 MB limit on files allowed in repositories (see GitHub Docs).
The easiest solution to upload large files on GitHub is setting up Git Large File Storage. It is an excellent addition to your workflow, and it will allow you to store files up to 1 GB in your repository. (read the details here, terms are a little bit more subtle than this)
Git LFS setup
- Install Git LFS.
- Check everything is OK by running
git lfs install.
cdto your local repository.
git lfs track [big_file]to tell Git LFS which file to manage.
- Git LFS lists all the files and associated options in a .gitattribute file, so make sure it is tracked by git with
git add .gitattributes.
git add [big_file].
git commit -m "start tracking big_file".
git push origin main.
That is it! Now you can use git as you ordinarily would! Everything happens under the hood, and git LFS will take over when needed.
When you clone a repository with LFS files, after
git clone [repo], run
git lfs pullto retrieve the large files. Git LFS replaces large files by a text file listing the information necessary for LFS to retrieve your file.
So with this, we keep things simple, and have all the files at the same place! Let us look at the next problems:
CPU and RAM Limitation
As said earlier, I managed to (patiently) do most of the work on my laptop. I tried to use the
chunksize= parameter, and
dask to save some time (and this kind of stuff, as well). Overall, it was not too bad, even without these "accelerators". But when I needed to stitch the prepared data back together to start training the model, the notebook kernel kept on dying. I later realised this last preparation step alone requires about 45 GB of RAM, and then I had to train an XGBoost model on that dataset... I needed more computer power! As I couldn't justify spending thousands of pounds on a new (decent) machine, I had to find something else:
Google Cloud Platform
I chose Google because they offer “free credit” to experiment with their products, and it turns out this $300 credit actually goes a long way (as long as you do not let a VM run when it is not in use). But Microsoft Azure or Amazon Web Services would have done the trick as well.
Creating an Account
Create a new project
Once in the Console, you will notice the banner mentions “My First Project”, the one created by default. You can always change it if needed, but it is not necessary right now.
Create a Notebook
Things are getting real now! It is time to create your first Virtual Machine, or instance in the GCP jargon. It will cost you money, so I wouldn’t advise you to go for the 60 CPU with 11 TB of RAM just yet. Spend those $300 wisely!
Head to the hamburger menu and go all the way down the list, to AI Platform > Notebooks > New Instance.
Google has made things easy for us, and we have access to several off-the-shelf machines. You will be able to modify the configuration later-on, there is no need to stress this step too much.
What you should keep in mind though is that when a notebook/instance is running, it costs you money. It is all fine and well when you still have the opening credit, but we all know someone who was charged thousands of dollars for not stopping an instance, so be cautious!
START button, then
open JupyterLab. Open a terminal window
File > New... > Terminal and run
pip list, this command will tell you which packages are available on your VM. If you need other packages, run
!pip install [library] in a notebook or use apt, the Debian package manager:
sudo apt-get install [package].
Once you have resolved the packages, you can go ahead and clone the repositories required for your projects. The easiest way to do so is to open a terminal and use
git clone [repo address].
Before closing that terminal window, you might want to repeat the Git LFS setup steps if need being.
Once you have gone through those steps, the final thing to do is to modify the path of the source files (in the notebooks/scripts). If you are not too sure about where the repository files are, open a terminal window and navigate through the directory tree using
ls, to list the directories, and
cd [directory] to change directory.
After unleashing the power of cloud computing onto your project, it is time to save your work. In the
Git menu (underlined in the following picture), you will find:
Git Interfacewhich opens the Git tab, if you prefer to use the git GUI provided. The circled icon is the
Git Command in Terminalwhich opens a terminal in the correct folder, if you prefer to run traditional git commands from the terminal window.
One way or another, you can easily add, commit, then push your hard work back to your repository.
As I mentioned, don’t forget to stop the VM before exiting the website. This picture shows the green check, indicating the instance is up and running (the
STOP button is activated).
This one displays the grey square, meaning the VM is not running (the
START button is activated).
If you are serious about AI, you might decide to build yourself a rig. The right hardware will depend a lot on the type of work you are doing, and the libraries you are using. However, if this is your case, head to Tim Dettmers’ blog for some sound advice!
But as you can see in this brilliant post from Tim Dettmers (again), the cloud can also be a cost-effective way to do data science.
Welcome to the 21st century!