t5_sql_askdb / README.md

Update README.md

2ac6c7b verified 3 months ago

5.9 kB

	---
	library_name: transformers
	tags: [text2sql, sql-generation, t5, natural-language-processing]
	---

	# Model Card for ThotaBhanu/t5_sql_askdb

	## Model Details

	### Model Description

	This model is a T5-based Natural Language to SQL converter, fine-tuned on the WikiSQL dataset. It is designed to convert English natural language queries into SQL queries that can be executed on relational databases.

	- Developed by: Bhanu Prasad Thota
	- Shared by: Bhanu Prasad Thota
	- Model type: T5-based Sequence-to-Sequence Model
	- Language(s): English
	- License: MIT
	- Finetuned from model: `t5-large`

	This model is particularly useful for text-to-SQL applications, allowing users to query databases using plain English instead of writing SQL.

	---

	## Model Sources

	- Repository: [https://huggingface.co/ThotaBhanu/t5_sql_askdb](https://huggingface.co/ThotaBhanu/t5_sql_askdb)
	- Paper [optional]: N/A
	- Demo [optional]: Coming soon

	---

	## Uses

	### Direct Use

	- Convert natural language questions into SQL queries
	- Assist in database query automation
	- Can be used in chatbots, data analytics tools, and enterprise database search systems

	### Downstream Use

	- Can be fine-tuned further on custom datasets to improve domain-specific SQL generation
	- Can be integrated into business intelligence tools for better user interaction

	### Out-of-Scope Use

	- The model does not infer database schema automatically
	- May generate incorrect SQL for complex nested queries or multi-table joins
	- Not suitable for non-relational (NoSQL) databases

	---

	## Bias, Risks, and Limitations

	- The model may not always generate valid SQL for custom database schemas
	- Assumes consistent column naming, which may not always be the case in enterprise databases
	- Performance depends on how well the input query aligns with the training data format

	### Recommendations

	- Always validate generated SQL before executing on a live database
	- Use schema-aware validation methods for production environments
	- Consider fine-tuning the model on domain-specific SQL queries

	---

	## How to Get Started with the Model

	Use the code below to generate SQL queries from natural language:

	```python
	from transformers import T5Tokenizer, T5ForConditionalGeneration

	# Load model and tokenizer
	model_name = "ThotaBhanu/t5_sql_askdb"
	tokenizer = T5Tokenizer.from_pretrained(model_name)
	model = T5ForConditionalGeneration.from_pretrained(model_name)

	# Function to convert query to SQL
	def generate_sql(query):
	input_text = f"Convert to SQL: {query}"
	inputs = tokenizer(input_text, return_tensors="pt")
	output = model.generate(**inputs)
	return tokenizer.decode(output[0], skip_special_tokens=True)

	# Example usage
	query = "Find all employees who joined in 2020"
	sql_query = generate_sql(query)

	print(f"📝 Query: {query}")
	print(f"🛠 Generated SQL: {sql_query}")


	## Training Details

	### Training Data

	Dataset: WikiSQL
	Size: 80,654 pairs of natural language questions and SQL queries
	Preprocessing: Tokenization using T5Tokenizer, max length 128


	### Training Procedure

	Training framework: Hugging Face Transformers + PyTorch
	Hardware used: NVIDIA V100 GPU
	Optimizer: AdamW
	Learning rate: 5e-5
	Batch size: 8
	Epochs: 5

	#### Training Hyperparameters

	Training precision: Mixed precision (fp16)
	Gradient accumulation: Yes (to handle large batch sizes)

	#### Speeds, Sizes, Times [optional]

	<!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->

	[More Information Needed]

	## Evaluation

	<!-- This section describes the evaluation protocols and provides the results. -->

	### Testing Data, Factors & Metrics

	#### Testing Data

	<!-- This should link to a Dataset Card if possible. -->

	[More Information Needed]

	#### Factors

	<!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->

	[More Information Needed]

	#### Metrics

	<!-- These are the evaluation metrics being used, ideally with a description of why. -->

	[More Information Needed]

	### Results

	[More Information Needed]

	#### Summary



	## Model Examination [optional]

	<!-- Relevant interpretability work for the model goes here -->

	[More Information Needed]

	## Environmental Impact

	<!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->

	Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).

	- Hardware Type: [More Information Needed]
	- Hours used: [More Information Needed]
	- Cloud Provider: [More Information Needed]
	- Compute Region: [More Information Needed]
	- Carbon Emitted: [More Information Needed]

	## Technical Specifications [optional]

	### Model Architecture and Objective

	[More Information Needed]

	### Compute Infrastructure

	[More Information Needed]

	#### Hardware

	[More Information Needed]

	#### Software

	[More Information Needed]

	## Citation [optional]

	@misc{t5_sql_askdb,
	author = {Bhanu Prasad Thota},
	title = {T5-SQL AskDB Model},
	year = {2025},
	publisher = {Hugging Face},
	howpublished = {\url{https://huggingface.co/ThotaBhanu/t5_sql_askdb}}
	}


	BibTeX:

	[More Information Needed]

	APA:

	[More Information Needed]

	## Glossary [optional]

	<!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->

	[More Information Needed]

	## More Information [optional]

	[More Information Needed]

	## Model Card Authors [optional]

	[More Information Needed]

	## Model Card Contact

	[More Information Needed]