Inference of Text-Generation models
On JupyterHub
In this part of our tutorial about LLMs we will learn how to use a text-generation model from Huggingface on JupyterHub. If you got a local machine with CUDA installed, all the steps should be the same, but installing the right environment correctly can be tideous.
So start a jupyter notebook and select the standard kernel. Make sure you selected some GPU when starting the JupyterHub. If you want access to your PALMA files, also make sure to select Palma storage integration. This will allow you to access the PALMA file system under /palma
.
If you need to install transformers, please use pip in your default python environment:
pip install transformers
Torch should be pre-installed (JupyterHub) or in your module chain (PALMA). On a local machine it can be difficult to install CUDA and then torch within the correct environment.
Here you can see if CUDA is available and how much VRAM you can access. This notebook is written for usage of only one GPU, usually cuda:0
but should also run with cpu
.
The Huggingface caching mystery
Huggingface provides a very simple interface for models. The package transformers automatically downloads them but you don’t know where these models are on your disk and they can be huge! So we should be careful when loading the models.
We have different options for where to load our models:
- The default Huggingface Cache is a hidden folder
~/.cache/huggingface/models
. But since the models are huge, this can easily burst your partitions! - If you want to use PALMA you can use
/palma/scratch/tmp/$USER/huggingface/models/
and remove it later - Otherwise for small models (!) just use a not hidden cache dir (e.g.
~/huggingface/models
) for instance and remove it later. If you get errors here, that might be due to permissions. Then use the standard Huggingface cache. - For bigger models you could also use an OpenStack usershare
/cloud/wwu1/{group}/{share}/cache
- Or just leave it as it is, but be aware!
So now remember where you stored your model, you will need the cache dir later on. We will continue loading the model into and from cache_dir = "/cloud/wwu1/d_reachiat/incubai/cache"
. Below you see, how to download and start the smallest Pythia model. Pythia is a collection of Open Source LLMs for text generation, similar to GPT (closed source) or Llama (restriced license).
We can now proceed to prompting. That means we give the model a sentence, for which it generates new text that should follow the given input sentence in a logical way.
As you will see, the text which is generated by our small pythia model is repetitive and not very good. Changing some parameters can help, but the small models used for testing purposes here are not suitable to find good parameters.
To learn more about text generation strategies, you can visit the huggingface site about generation strategies.
You can now use the following way to prompt:
In order to simplify this process, we can build a pipeline. The first argument of the pipeline is the task we want to use the pipeline for, in our case text generation. The other inputs are the previously defined model and tokenizer, as well as the arguments of the model.generate
function from above and our specified device.
So let’s build our pipeline:
If you have some prompts in a text file, you can load that text file and use a pipeline to process. It is more efficient to let the pipeline iterate over the data than to use a loop over the pipeline.
Now we were hopefully able to:
- download a text-generation model
- load the model from a self-defined cache
- use multiple prompts to test the model
Use a python script
In order to move on PALMA to deploy bigger models, we need to convert everyting into a python script. You can find the scripts in incubaitor/2_PALMA/2_2_LLMs-text-generation/scripts/
. You can now try if the script also runs by trying the following prompt in the terminal:
python pythia.py --cache_dir /cloud/wwu1/d_reachiat/incubai/cache --size 70m --prompt "My sample prompt"
You might also experiment with a prompt collection like those in incubaitor/2_PALMA/2_2_LLMs-text-generation/prompts/prompts.txt
and an outfile with
python pythia.py --cache_dir /cloud/wwu1/d_reachiat/incubai/cache --size 70m --prompt_file ../prompts/prompts.txt --out_file out.csv
where you can get a nice csv of your prompts and the generated text of the model.
If all this works for you on the JupyterHub, you might be interested to deploy bigger models on PALMA.
When you run the script you will get information of the GPU memory (VRAM) usage of the model. You need to add a CUDA overhead of about 1 GB to find the expected memory usage. Thus, the 6.9b model of pythia is too big for JupyterHub. While running the pipeline, you can open a terminal and type nvidia-smi
to find the memory usage.
Moving to PALMA
Now if everything goes right, we want to move to PALMA. We want to use the GPU partitions there in order to run bigger models than we can in the JupyterHub. At first, the gpuexpress
partition is suitable for testing.
Before you try it on PALMA, first make sure that a small model is running on the JupyterHub!
If you don’t know how to use PALMA, read our tutorial in 2_1_PALMA. Also the HPC Wiki gives a good overview about how to use PALMA.
Installing requirements
We now want to use the shell scripts in the folder incubaitor/2_PALMA/2_2_LLMs-text-generation/jobs
to generate text from our models.
We use a specific so called toolchain to be able to use CUDA. The following toolchain is suitable:
You can find this toolchain by typing module spider PyTorch
. But as the login node is on a different architecture, you would need to make job script as below to find the right name and CUDA Version on the other archtecture.
Typing these commands in the command line shows, that the last module is not available on the login node. Therefore, to install further packages, we must be inside this toolchain. To make sure, that the right Python and PyTorch Versions are used we install the package transformers via pip with a job script install.sh. As we use Torch 1.10, which is a pretty old version we take care to use a transformers version that is suitable, for instance transformers==4.33.1
.
Then we can run the install script on the right architecture by using the command sbatch install.sh
in the directory incubaitor/2_PALMA/2_2_LLMs-text-generation/jobs/
. When the job is finished, check the outfile with vi, so you can be sure no new torch version was installed (what might bring a lot of conflicts).
In case something went wrong, remove the installed packages in your home directory (due to the --user
flag they are installed into ~/.local/
, you can remove the folders there).
Prepare and run the model
Now hopefully everything went right. Check the pythia-70m-test.sh
script now. If you have your model in your usershare, you should use it as the model dir in the script. If you don’t have a usershare, you can copy the whole model directory to your scratch directory. For instance on PALMA (!) use
cp -r ~/cloud/wwu1/u_jupyterhub/home/<first letter of username>/<username>/.cache/huggingface/models/models--EleutherAI--pythia-70m-deduped $WORK/incubaitor/2_PALMA/2_2_LLMs-text-generation/models/
if you used the standard huggingface cache (see caching Huggingface above). There is no nice way to download the Huggingface models directly, so if necessary, start a script (see above) that downloads the models to the scratch dir, but does not start (or crashes due to limits).
Now the data should be in your scratch directory. We should be ready to run the first small model. Go back to ~/incubaitor/2_PALMA/2_2_LLMs-text-generation/jobs/
and start the job with the command sbatch pythia-70m-test.sh
. In the output file, you should read if everything went well via vi slurm-pythia-test-1b-express.out
. Furthermore, the out-file should be on your scratch partition. You should be able to read the contents of the outfile with vi /scratch/tmp/<username>/pythia-70m-express.csv
or access that file by copying it to $WORK/transfer
if you prepared the PALMA Nextcloud Integration and download it via the Web-Interface (still under development).
For further info about how to transfer data, visit the HPC documentation.
If you are happy with the results, test if you can also get the 1b
version to work in the same way.
Change the script for your needs
If you want to change things in the script or test other functions of the model, you can also play with the small models using the JupyterHub. If resources are available, you can also start the jupyter.sh
script on PALMA and play around on your own machine. When you are ready, you can make these changes in the pythia.py
file (the best way would be to clone it into your private git, make changes, pull the changes to PALMA and run the script for testing purposes on a small model).
Then, if resources are available, you can try to run inference on a bigger model on Palma. See the job scripts for the 6.9b and 12b model.
Llama-2 and other models
The Llama text-generator is provided by meta. To download the Huggingface version, you need to register for using the Llama-2 models on meta and an access key of Huggingface. After downloading this model for instance on JupyterHub and caching it to your usershare or scratch directory, you can access it on PALMA. The smallest model might be too big for JupyterHub and crash your Kernel but it might be a convenient way to download it.
If you got that, see the llama.py
script and the corresponding job script. The only change in the python file is the model selection. This way you can change your script for whatever Huggingface text-generation
model you want to use.
Beyond text generation
There are many other types of models available on Huggingface. They all work with similar pipelines. You can check on the model card (top right </> Use in transformers
) how to load the models and how to build a pipeline. For the pipeline due to the caching issues, use a similar approach as above. (Remember to set local_files_only=False
when downloading the models!)
Then you need to check, how to provide the pipeline with input and how the output looks like. This should also be provided in the model card. For instance the following can be used for text-classification:
You can also use multiple questions (on multiple texts) iterating through the pipeline.
Now you can use Huggingface models! Further models for audio recognition or image recognition will need some other packages like OpenCV, which could also be in the toolchain of PALMA.