# ProFactory Frequently Asked Questions (FAQ) ## Installation and Environment Configuration Issues ### Q1: How to properly install ProFactory? **Answer**: You can find the installation step in README.md at the root directory. ### Q2: What should I do if I encounter the error "Could not find a specific dependency" during installation? **Answer**: There are several solutions for this situation: 1. Try installing the problematic dependency individually: ```bash pip install name_of_the_problematic_library ``` 2. If it is a CUDA-related library, ensure you have installed a PyTorch version compatible with your CUDA version: ```bash # For example, for CUDA 11.7 pip install torch==2.0.0+cu117 -f https://download.pytorch.org/whl/torch_stable.html ``` 3. For some special libraries, you may need to install system dependencies first. For example, on Ubuntu: ```bash sudo apt-get update sudo apt-get install build-essential ``` ### Q3: How can I check if my CUDA is installed correctly? **Answer**: You can verify if CUDA is installed correctly by the following methods: 1. Check the CUDA version: ```bash nvidia-smi ``` 2. Verify if PyTorch can recognize CUDA in Python: ```python import torch print(torch.cuda.is_available()) # Should return True print(torch.cuda.device_count()) # Displays the number of GPUs print(torch.cuda.get_device_name(0)) # Displays the GPU name ``` 3. If PyTorch cannot recognize CUDA, ensure you have installed the matching versions of PyTorch and CUDA. ## Hardware and Resource Issues ### Q4: What should I do if I encounter a "CUDA out of memory" error during runtime? **Answer**: This error indicates that your GPU memory is insufficient. Solutions include: 1. **Reduce the batch size**: This is the most direct and effective method. Reduce the batch size in the training configuration by half or more. 2. **Use a smaller model**: Choose a pre-trained model with fewer parameters, such as switching from ProtBERT to ESM-1b. 3. **Enable gradient accumulation**: Increase the `gradient_accumulation_steps` parameter value, for example, set it to 2 or 4, which can reduce memory usage without decreasing the effective batch size. 4. **Use mixed precision training**: Enable the `fp16` option in the training options, which can significantly reduce memory usage. 5. **Reduce the maximum sequence length**: If your data allows, you can decrease the `max_seq_length` parameter. ### Q5: How can I determine what batch size I should use? **Answer**: Determining the appropriate batch size requires balancing memory usage and training effectiveness: 1. **Start small and gradually increase**: Begin with smaller values (like 4 or 8) and gradually increase until memory is close to its limit. 2. **Refer to benchmarks**: For common protein models, most studies use a batch size of 16-64, but this depends on your GPU memory and sequence length. 3. **Monitor the training process**: A larger batch size may make each training iteration more stable but may require a higher learning rate. 4. **Rule of thumb for memory issues**: If you encounter memory errors, first try halving the batch size. ## Dataset Issues ### Q6: How do I prepare a custom dataset? **Answer**: Preparing a custom dataset requires the following steps: 1. **Format the data**: The data should be organized into a CSV file, containing at least the following columns: - `sequence`: The protein sequence, represented using standard amino acid letters - Label column: Depending on your task type, this can be numerical (regression) or categorical (classification) 2. **Split the data**: Prepare training, validation, and test sets, such as `train.csv`, `validation.csv`, and `test.csv`. 3. **Upload to Hugging Face**: - Create a dataset repository on Hugging Face - Upload your CSV file - Reference it in ProFactory using the `username/dataset_name` format 4. **Create dataset configuration**: The configuration should include the problem type (regression or classification), number of labels, and evaluation metrics. ### Q7: What should I do if I encounter a format error when importing my dataset? **Answer**: Common format issues and their solutions: 1. **Incorrect column names**: Ensure the CSV file contains the necessary columns, especially the `sequence` column and label column. 2. **Sequence format issues**: - Ensure the sequence contains only valid amino acid letters (ACDEFGHIKLMNPQRSTVWY) - Remove spaces, line breaks, or other illegal characters from the sequence - Check if the sequence length is within a reasonable range 3. **Encoding issues**: Ensure the CSV file is saved with UTF-8 encoding. 4. **CSV delimiter issues**: Ensure the file uses the correct delimiter (usually a comma). You can use a text editor to view and correct it. 5. **Handling missing values**: Ensure there are no missing values in the data, or handle them appropriately. ### Q8: My dataset is large, and the system loads slowly or crashes. What should I do? **Answer**: For large datasets, you can: 1. **Reduce the dataset size**: If possible, test your method with a subset of the data first. 2. **Increase data loading efficiency**: - Use the `batch_size` parameter to control the amount of data loaded at a time - Enable data caching to avoid repeated loading - Preprocess the data to reduce file size (e.g., remove unnecessary columns) 3. **Dataset sharding**: Split large datasets into multiple smaller files and process them one by one. 4. **Increase system resources**: If possible, increase RAM or use a server with more memory. ## Training Issues ### Q9: How can I recover if the training suddenly interrupts? **Answer**: Methods to handle training interruptions: 1. **Check checkpoints**: The system periodically saves checkpoints (usually in the `ckpt` directory). You can recover from the most recent checkpoint: - Look for the last saved model file (usually named `checkpoint-X`, where X is the step number) - Specify the checkpoint path as the starting point in the training options 2. **Use the checkpoint recovery feature**: Enable the checkpoint recovery option in the training configuration. 3. **Save checkpoints more frequently**: Adjust the frequency of saving checkpoints, for example, save every 500 steps instead of the default every 1000 steps. ### Q10: How can I speed up training if it is very slow? **Answer**: Methods to speed up training: 1. **Hardware aspects**: - Use a more powerful GPU - Use multi-GPU training (if supported) - Ensure data is stored on an SSD rather than an HDD 2. **Parameter settings**: - Use mixed precision training (enable the fp16 option) - Increase the batch size (if memory allows) - Reduce the maximum sequence length (if the task allows) - Decrease validation frequency (the `eval_steps` parameter) 3. **Model selection**: - Choose a smaller pre-trained model - Use parameter-efficient fine-tuning methods (like LoRA) ### Q11: What does it mean if the loss value does not decrease or if NaN values appear during training? **Answer**: This usually indicates that there is a problem with the training: 1. **Reasons for loss not decreasing and solutions**: - **Learning rate too high**: Try reducing the learning rate, for example, from 5e-5 to 1e-5 - **Optimizer issues**: Try different optimizers, such as switching from Adam to AdamW - **Initialization issues**: Check the model initialization settings - **Data issues**: Validate if the training data has outliers or label errors 2. **Reasons for NaN values and solutions**: - **Gradient explosion**: Add gradient clipping, set the `max_grad_norm` parameter - **Learning rate too high**: Significantly reduce the learning rate - **Numerical instability**: This may occur when using mixed precision training; try disabling the fp16 option - **Data anomalies**: Check if there are extreme values in the input data ### Q12: What is overfitting, and how can it be avoided? **Answer**: Overfitting refers to a model performing well on training data but poorly on new data. Methods to avoid overfitting include: 1. **Increase the amount of data**: Use more training data or data augmentation techniques. 2. **Regularization methods**: - Add dropout (usually set to 0.1-0.3) - Use weight decay - Early stopping: Stop training when the validation performance no longer improves 3. **Simplify the model**: - Use fewer layers or smaller hidden dimensions - Freeze some layers of the pre-trained model (using the freeze method) 4. **Cross-validation**: Use k-fold cross-validation to obtain a more robust model. ## Evaluation Issues ### Q13: How do I interpret evaluation metrics? Which metric is the most important? **Answer**: Different tasks focus on different metrics: 1. **Classification tasks**: - **Accuracy**: The proportion of correct predictions, suitable for balanced datasets - **F1 Score**: The harmonic mean of precision and recall, suitable for imbalanced datasets - **MCC (Matthews Correlation Coefficient)**: A comprehensive measure of classification performance, more robust to class imbalance - **AUROC (Area Under the ROC Curve)**: Measures the model's ability to distinguish between different classes 2. **Regression tasks**: - **MSE (Mean Squared Error)**: The sum of the squared differences between predicted and actual values, the smaller the better - **RMSE (Root Mean Squared Error)**: The square root of MSE, in the same units as the original data - **MAE (Mean Absolute Error)**: The average of the absolute differences between predicted and actual values - **R² (Coefficient of Determination)**: Measures the proportion of variance explained by the model, the closer to 1 the better 3. **Most important metric**: Depends on your specific application needs. For example, in drug screening, you may focus more on true positive rates; for structural prediction, you may focus more on RMSE. ### Q14: What should I do if the evaluation results are poor? **Answer**: Common strategies to improve model performance: 1. **Data quality**: - Check for errors or noise in the data - Increase the number of training samples - Ensure the training and test set distributions are similar 2. **Model adjustments**: - Try different pre-trained models - Adjust hyperparameters like learning rate and batch size - Use different fine-tuning methods (full parameter fine-tuning, LoRA, etc.) 3. **Feature engineering**: - Add structural information (e.g., using foldseek features) - Consider sequence characteristics (e.g., hydrophobicity, charge, etc.) 4. **Ensemble methods**: - Train multiple models and combine results - Use cross-validation to obtain a more robust model ### Q15: Why does my model perform much worse on the test set than on the validation set? **Answer**: Common reasons for decreased performance on the test set: 1. **Data distribution shift**: - The training, validation, and test set distributions are inconsistent - The test set contains protein families or features not seen during training 2. **Overfitting**: - The model overfits the validation set because it was used for model selection - Increasing regularization or reducing the number of training epochs may help 3. **Data leakage**: - Unintentionally leaking test data information into the training process - Ensure data splitting is done before preprocessing to avoid cross-contamination 4. **Randomness**: - If the test set is small, results may be influenced by randomness - Try training multiple models with different random seeds and averaging the results ## Prediction Issues ### Q16: How can I speed up the prediction process? **Answer**: Methods to speed up predictions: 1. **Batch prediction**: Use batch prediction mode instead of single-sequence prediction, which can utilize the GPU more efficiently. 2. **Reduce computation**: - Use a smaller model or a more efficient fine-tuning method - Reduce the maximum sequence length (if possible) 3. **Hardware optimization**: - Use a faster GPU or CPU - Ensure predictions are done on the GPU rather than the CPU 4. **Model optimization**: - Try model quantization (e.g., int8 quantization) - Exporting to ONNX format may provide faster inference speeds ### Q17: What could be the reason for the prediction results being significantly different from expectations? **Answer**: Possible reasons for prediction discrepancies: 1. **Data mismatch**: - The sequences being predicted differ from the training data distribution - There are significant differences in sequence length, composition, or structural features 2. **Model issues**: - The model is under-trained or overfitted - An unsuitable pre-trained model was chosen for the task 3. **Parameter configuration**: - Ensure the parameters used during prediction (like maximum sequence length) are consistent with those used during training - Check if the correct problem type (classification/regression) is being used 4. **Data preprocessing**: - Ensure the prediction data undergoes the same preprocessing steps as the training data - Check if the sequence format is correct (standard amino acid letters, no special characters) ### Q18: How can I batch predict a large number of sequences? **Answer**: Steps for efficient batch prediction: 1. **Prepare the input file**: - Create a CSV file containing all sequences - The file must include a `sequence` column - Optionally include an ID or other identifier columns 2. **Use the batch prediction feature**: - Go to the prediction tab - Select "Batch Prediction" mode - Upload the sequence file - Set an appropriate batch size (usually 16-32 is a good balance) 3. **Optimize settings**: - Increasing the batch size can improve throughput (if memory allows) - Reducing unnecessary feature calculations can speed up processing 4. **Result handling**: - After prediction is complete, the system will generate a CSV file containing the original sequences and prediction results - You can download this file for further analysis ## Model and Result Issues ### Q19: Which pre-trained model should I choose? **Answer**: Model selection recommendations: 1. **For general tasks**: - ESM-2 is suitable for various protein-related tasks, balancing performance and efficiency - ProtBERT performs well on certain sequence classification tasks 2. **Considerations**: - **Data volume**: When data is limited, a smaller model may be better (to avoid overfitting) - **Sequence length**: For long sequences, consider models that support longer contexts - **Computational resources**: When resources are limited, choose smaller models or parameter-efficient methods - **Task type**: Different models have their advantages in different tasks 3. **Recommended strategy**: If conditions allow, try several different models and choose the one that performs best on the validation set. ### Q20: How do I interpret the loss curve during training? **Answer**: Guidelines for interpreting the loss curve: 1. **Ideal curve**: - Both training loss and validation loss decrease steadily - The two curves eventually stabilize and converge - The validation loss stabilizes near its lowest point 2. **Common patterns and their meanings**: - **Training loss continues to decrease while validation loss increases**: Signal of overfitting; consider increasing regularization - **Both losses stagnate at high values**: Indicates underfitting; may need a more complex model or longer training - **Curve fluctuates dramatically**: The learning rate may be too high; consider lowering it - **Validation loss is lower than training loss**: This may indicate a data splitting issue or batch normalization effect 3. **Adjusting based on the curve**: - If validation loss stops improving early, consider early stopping - If training loss decreases very slowly, try increasing the learning rate - If there are sudden jumps in the curve, check for data issues or learning rate scheduling ### Q21: How do I save and share my model? **Answer**: Guidelines for saving and sharing models: 1. **Local saving**: - After training is complete, the model will be automatically saved in the specified output directory - The complete model includes model weights, configuration files, and tokenizer information 2. **Important files**: - `pytorch_model.bin`: Model weights - `config.json`: Model configuration - `special_tokens_map.json` and `tokenizer_config.json`: Tokenizer configuration 3. **Sharing the model**: - **Hugging Face Hub**: The easiest way is to upload to Hugging Face - Create a model repository - Upload your model files - Add model descriptions and usage instructions in the readme - **Local export**: You can also compress the model folder and share it - Ensure all necessary files are included - Provide environment requirements and usage instructions 4. **Documentation**: Regardless of the sharing method, you should provide: - Description of the training data - Model architecture and parameters - Performance metrics - Usage examples ## Interface and Operation Issues ### Q22: What should I do if the interface loads slowly or crashes? **Answer**: Solutions for interface issues: 1. **Browser-related**: - Try using different browsers (Chrome usually has the best compatibility) - Clear browser cache and cookies - Disable unnecessary browser extensions 2. **Resource issues**: - Ensure the system has enough memory - Close other resource-intensive programs - If running on a remote server, check the server load 3. **Network issues**: - Ensure the network connection is stable - If using through an SSH tunnel, check if the connection is stable 4. **Restart services**: - Try restarting the Gradio service - In extreme cases, restart the server ### Q23: Why does my training stop responding midway? **Answer**: Possible reasons and solutions for training stopping responding: 1. **Resource exhaustion**: - Insufficient system memory - GPU memory overflow - Solution: Reduce batch size, use more efficient training methods, or increase system resources 2. **Process termination**: - The system's OOM (Out of Memory) killer terminated the process - Server timeout policies may terminate long-running processes - Solution: Check system logs, use tools like screen or tmux to run in the background, reduce resource usage 3. **Network or interface issues**: - Browser crashes or network disconnections - Solution: Run training via command line, or ensure a stable network connection 4. **Data or code issues**: - Anomalies or incorrect formats in the dataset causing processing to hang - Solution: Check the dataset, and test the process with a small subset of data