Spaces:

juancamval
/

graph_generator

Sleeping

App Files Files Community

Integrate-Gemma

by angelicaporto - opened Apr 27

base: refs/heads/main

←

from: refs/pr/2

Discussion Files changed

+172

-911

Files changed (7) hide show

README.md +7 -120
app.py +150 -181
assets/supabase_db.png +0 -0
documentacion.txt +1 -0
gemma3test.py +0 -272
gpt2test.py +0 -332
requirements.txt +14 -6

README.md CHANGED Viewed

@@ -1,126 +1,13 @@
 ---
-title: GrapGen
-emoji: 🙂
-colorFrom: blue
-colorTo: yellow
 sdk: streamlit
-sdk_version: 1.45.0
 app_file: app.py
 pinned: false
 ---
-# 🧠 Graph Generator
-Espacio en Hugging Face para generar gráficos con ayuda de modelos LLM y una base de datos en Supabase.
-## Clonar el repositorio
-```
-git clone https://huggingface.co/spaces/juancamval/graph_generator
-cd graph_generator
-````
-## Crear y activar entorno virtual
-#### En Windows:
-```
-python -m venv .venv
-.\.venv\Scripts\activate
-```
-#### En Linux/Mac:
-```
-python3 -m venv .venv
-source .venv/bin/activate
-```
-## Instalar dependencias
-```
-pip install -r requirements.txt
-```
-#### requirements.txt
-```txt
-pandas
-numpy
-streamlit
-python-dotenv
-matplotlib
-pandasai
-```
-## Variables de entorno `.env`
-Crea un archivo `.env` en la raíz del proyecto con esta estructura:
-```env
-SUPABASE_URL=
-SUPABASE_KEY=
-```
-> Puedes encontrar estas variables en **Supabase > Project Settings > API Settings**:
->
-> * **URL** → `SUPABASE_URL`
-> * **anon public** → `SUPABASE_KEY`
-### Estructura esperada de la base de datos
-Asegúrate de que la estructura de la base de datos en Supabase coincida con la siguiente imagen:
-![estructura\_db](./assets/supabase_db.png)
----
-## Modelo LLM con Ollama
-#### 1. Instalar [Ollama](https://ollama.com/)
-Descarga e instala Ollama desde su sitio oficial.
-#### 2. Abrir Ollama
-Inicia Ollama para que esté activo en segundo plano.
-#### 3. Descargar el modelo `gemma3:12b`
-```
-ollama pull gemma3:12b
-```
----
-## Ejecutar la app
-Con el entorno virtual activado, y Ollama corriendo:
-```
-streamlit run app.py
-```
----
-## ✅ Comandos resumen
-```
-git clone https://huggingface.co/spaces/juancamval/graph_generator
-cd graph_generator
-# Crear entorno virtual
-python -m venv .venv
-# Windows
-.\.venv\Scripts\activate
-# Linux/Mac
-source .venv/bin/activate
-# Instalar dependencias
-pip install -r requirements.txt
-# Descargar modelo con Ollama
-ollama pull gemma3:12b
-# Ejecutar la app
-streamlit run app.py
-```

 ---
+title: Graph Generator
+emoji: 📚
+colorFrom: purple
+colorTo: red
 sdk: streamlit
+sdk_version: 1.44.1
 app_file: app.py
 pinned: false
+short_description: LLM para la creación de gráficos basados en un dataset local
 ---
+Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

app.py CHANGED Viewed

@@ -1,181 +1,150 @@
-# ---------------------------------------------------------------------------------
-# Aplicación principal para cargar el modelo, generar prompts y explicar los datos
-# ---------------------------------------------------------------------------------
-import streamlit as st  # type: ignore
-import os
-import re
-import pandas as pd  # type: ignore
-from dotenv import load_dotenv  # type: ignore # Para cambios locales
-from supabase import create_client, Client  # type: ignore
-# from pandasai import SmartDataframe  # type: ignore
-from pandasai import SmartDatalake  # type: ignore # Porque ya usamos más de un df (más de una tabla de nuestra db)
-from pandasai.llm.local_llm import LocalLLM  # type: ignore
-from pandasai import Agent
-import matplotlib.pyplot as plt
-import time
-# ---------------------------------------------------------------------------------
-# Funciones auxiliares
-# ---------------------------------------------------------------------------------
-def generate_graph_prompt(user_query):
-    prompt = f"""
-            You are a senior data scientist analyzing European labor force data.
-            Given the user's request: "{user_query}"
-            1. Plot the relevant data using matplotlib:
-            - Use pandas indexing with boolean conditions, not .query().
-            - For example: df[(df['geo'] == 'Germany') & (df['year'] >= 2018)]
-            - Include clear axis labels and a descriptive title.
-            - Save the plot as an image file (e.g., temp_chart.png).
-            2. After plotting, write a *concise analytical summary* of the trend based on those years. The summary should:
-            - Use .diff() followed by .idxmax() and .idxmin() to find where the largest change occurs.
-            - Use .loc[] to retrieve the corresponding year and value.
-            - Calculate percent changes safely (check for divide-by-zero).
-            - Avoid using .index() on float values.
-            3. Store the summary in a variable named explanation.
-            4. Return a result dictionary structured as follows:
-            result = {{
-                "type": "plot",
-                "value": "temp_chart.png",
-                "explanation": explanation
-            }}
-            IMPORTANT: Use only the data available in the input DataFrame.
-            """
-    return prompt
-# ---------------------------------------------------------------------------------
-# Configuración de conexión a Supabase
-# ---------------------------------------------------------------------------------
-# Cargar variables de entorno desde archivo .env
-load_dotenv()
-# Conectar las credenciales de Supabase (ubicadas en "Secrets" en Streamlit)
-SUPABASE_URL = os.getenv("SUPABASE_URL")
-SUPABASE_KEY = os.getenv("SUPABASE_KEY")
-# Crear cliente Supabase
-supabase: Client = create_client(SUPABASE_URL, SUPABASE_KEY)
-# Función para cargar datos de una tabla de Supabase
-# Tablas posibles: fertility, geo data, labor, population, predictions
-def load_data(table):
-    try:
-        if supabase:
-            response = supabase.from_(table).select("*").execute()
-            print(f"Response object: {response}")  # Inspeccionar objeto completo
-            print(f"Response type: {type(response)}")  # Verificar tipo de objeto
-            # Acceder a atributos relacionados a error o data
-            if hasattr(response, 'data'):
-                print(f"Response data: {response.data}")
-                return pd.DataFrame(response.data)
-            elif hasattr(response, 'status_code'):
-                print(f"Response status code: {response.status_code}")
-            elif hasattr(response, '_error'):  # Versiones antiguas
-                print(f"Older error attribute: {response._error}")
-                st.error(f"Error fetching data: {response._error}")
-                return pd.DataFrame()
-            else:
-                st.info("Response object does not have 'data' or known error attributes. Check the logs.")
-                return pd.DataFrame()
-        else:
-            st.error("Supabase client not initialized. Check environment variables.")
-            return pd.DataFrame()
-    except Exception as e:
-        st.error(f"An error occurred during data loading: {e}")
-        return pd.DataFrame()
-# ---------------------------------------------------------------------------------
-# Cargar datos iniciales
-# ---------------------------------------------------------------------------------
-labor_data = load_data("labor")
-fertility_data = load_data("fertility")
-# population_data = load_data("population")
-# predictions_data = load_data("predictions")
-# ---------------------------------------------------------------------------------
-# Inicializar LLM desde Ollama con PandasAI
-# ---------------------------------------------------------------------------------
-ollama_llm = LocalLLM(api_base="http://localhost:11434/v1",
-                      model="gemma3:12b",
-                      temperature=0.1,
-                      max_tokens=8000)
-# lm_studio_llm = LocalLLM(api_base="http://localhost:1234/v1")  # el modelo es gemma-3-12b-it-qat
-# sdl = SmartDatalake([labor_data, fertility_data, population_data, predictions_data], config={"llm": ollama_llm}) # DataFrame PandasAI-ready.
-# sdl = SmartDatalake([labor_data, fertility_data], config={"llm": ollama_llm})
-# agent = Agent([labor_data], config={"llm": lm_studio_llm})
-agent = Agent(
-    [
-        labor_data,
-        fertility_data
-    ],
-    config={
-        "llm": ollama_llm,
-        "enable_cache": False,
-        "enable_filter_extraction": False  # evita errores de parseo
-    }
-)
-# ---------------------------------------------------------------------------------
-# Configuración de la app en Streamlit
-# ---------------------------------------------------------------------------------
-# Título de la app
-st.title("Europe GraphGen  :blue[Graph generator] :flag-eu:")
-# Entrada de usuario para describir el gráfico
-user_input = st.chat_input("What graphics do you have in mind")
-if user_input:
-    with st.spinner('Generating answer...'):
-        try:
-            print(f"\nGenerating prompt...\n")
-            prompt = generate_graph_prompt(user_input)
-            print(f"\nPrompt generated: {prompt}\n")
-            start_time = time.time()
-            answer = agent.chat(prompt)
-            print(f"\nAnswer type: {type(answer)}\n")  # Verificar tipo de objeto
-            print(f"\nAnswer content: {answer}\n")  # Inspeccionar contenido de la respuesta
-            print(f"\nFull result: {agent.last_result}\n")
-            full_result = agent.last_result
-            explanation = full_result.get("explanation", "")
-            elapsed_time = time.time() - start_time
-            print(f"\nExecution time: {elapsed_time:.2f} seconds\n")
-            if isinstance(answer, str) and os.path.isfile(answer):
-                # Si el output es una ruta válida a imagen
-                im = plt.imread(answer)
-                st.image(im)
-                os.remove(answer)  # Limpiar archivo temporal
-                if explanation:
-                    st.markdown(f"*Explanation:* {explanation}")
-            else:
-                # Si no es una ruta válida, mostrar como texto
-                st.markdown(str(answer))
-        except Exception as e:
-            st.error(f"Error generating answer: {e}")

+# ---------------------------------------------------------------------------------
+# Aplicación principal para cargar el modelo, generar prompts y explicar los datos
+# ---------------------------------------------------------------------------------
+import streamlit as st  # type: ignore
+import os
+import re
+import pandas as pd  # type: ignore
+from dotenv import load_dotenv  # type: ignore # Para cambios locales
+from supabase import create_client, Client  # type: ignore
+from transformers import pipeline
+from pandasai import SmartDataframe  # type: ignore
+from pandasai.llm.starcoder import Starcoder # type: ignore
+# ---------------------------------------------------------------------------------
+# Funciones auxiliares
+# ---------------------------------------------------------------------------------
+# Función para extracción de código Python del output del modelo
+def extract_code(llm_output):
+    code_match = re.search(r"```python\n(.*?)\n```", llm_output, re.DOTALL)
+    if code_match:
+        return code_match.group(1)
+    return None
+# Función para generar prompts de gráficos comparativos
+# Ejemplo de prompt generado:
+# generate_graph_prompt("Germany", "France", "fertility rate", 2020, 2030)
+def generate_graph_prompt(country1, country2, metric, start_year, end_year):
+    prompt = f"""
+    You have access to a database of European countries with data on {metric}, labor force participation, population, and their predictions for future years.
+    Generate Python code using matplotlib to create a line graph showing the trend of {metric} for {country1} and {country2} from {start_year} to {end_year}.
+    Also, provide a concise explanation of what this graph represents for an end user who might not be familiar with the data.
+    """
+    return prompt
+# ---------------------------------------------------------------------------------
+# Configuración de conexión a Supabase
+# ---------------------------------------------------------------------------------
+# Cargar variables de entorno desde archivo .env
+load_dotenv()
+# Conectar las credenciales de Supabase (ubicadas en "Secrets" en Streamlit)
+SUPABASE_URL = os.getenv("SUPABASE_URL")
+SUPABASE_KEY = os.getenv("SUPABASE_KEY")
+# Crear cliente Supabase
+supabase: Client = create_client(SUPABASE_URL, SUPABASE_KEY)
+# Función para cargar datos de una tabla de Supabase
+# Tablas posibles: fertility, geo data, labor, population, predictions
+def load_data(table):
+    try:
+        if supabase:
+            response = supabase.from_(table).select("*").execute()
+            print(f"Response object: {response}")  # Inspeccionar objeto completo
+            print(f"Response type: {type(response)}")  # Verificar tipo de objeto
+            # Acceder a atributos relacionados a error o data
+            if hasattr(response, 'data'):
+                print(f"Response data: {response.data}")
+                return pd.DataFrame(response.data)
+            elif hasattr(response, 'status_code'):
+                print(f"Response status code: {response.status_code}")
+            elif hasattr(response, '_error'):  # Versiones antiguas
+                print(f"Older error attribute: {response._error}")
+                st.error(f"Error fetching data: {response._error}")
+                return pd.DataFrame()
+            else:
+                st.info("Response object does not have 'data' or known error attributes. Check the logs.")
+                return pd.DataFrame()
+        else:
+            st.error("Supabase client not initialized. Check environment variables.")
+            return pd.DataFrame()
+    except Exception as e:
+        st.error(f"An error occurred during data loading: {e}")
+        return pd.DataFrame()
+# ---------------------------------------------------------------------------------
+# Cargar datos iniciales
+# ---------------------------------------------------------------------------------
+# # Cargar datos desde la tabla "labor"
+data = load_data("labor")
+# TODO: La idea es luego usar todas las tablas, cuando ya funcione.
+# Se puede si el modelo funciona con las gráficas, sino que toca mejorarlo
+# porque serían consultas más complejas.
+# labor_data = load_data("labor")
+# fertility_data = load_data("fertility")
+# population_data = load_data("population")
+# predictions_data = load_data("predictions")
+# ---------------------------------------------------------------------------------
+# Inicializar modelo LLM
+# ---------------------------------------------------------------------------------
+# # Pendiente cambiar Keys dependiendo del modelo que escojamos
+# model_name = "google/flan-t5-small"  # Probando modelos
+# generator = pipeline("text-generation", model=model_name)
+# ---------------------------------------------------------------------------------
+# Inicializar PandasAI con StarCoder
+# ---------------------------------------------------------------------------------
+# Definir el modelo StarCoder desde Hugging Face
+huggingface_token = os.getenv("HUGGINGFACE_TOKEN")
+llm = Starcoder(api_token=huggingface_token)
+sdf = SmartDataframe(data, config={"llm": llm}) # DataFrame PandasAI-ready.
+# ---------------------------------------------------------------------------------
+# Configuración de la app en Streamlit
+# ---------------------------------------------------------------------------------
+# Título de la app
+st.title("_Europe GraphGen_  :blue[Graph generator] :flag-eu:")
+# Entrada de usuario para describir el gráfico
+user_input = st.text_input("What graphics do you have in mind")
+generate_button = st.button("Generate")
+# Manejo de evento de botón
+if generate_button and user_input:
+    # if data.empty and supabase is not None:
+    #     st.warning("Successfully connected to Supabase, but no data was loaded (either the table is empty or there was a query issue). Check the error message above if any.")
+    # elif not data.empty:
+    #     st.success("Successfully connected to Supabase and loaded data!")
+    #     st.dataframe(data.head())  # Mostrar una pequeña muestra del DataFrame
+    # elif supabase is None:
+    #     st.error("Failed to initialize Supabase client. Check environment variables in Settings.")
+    # else:
+    #     st.info("Attempted to load data. Check for any error messages above.")
+    # Procesar el input del usuario con PandasAI
+    if generate_button and user_input:
+        st.dataframe(data.head())
+        with st.spinner('Generating answer...'):
+            try:
+                answer = sdf.chat(user_input)
+                st.write(answer)
+            except Exception as e:
+                st.error(f"Error generating answer: {e}")
+# TODO: Output estructurado si vemos que es necesario.

assets/supabase_db.png DELETED Viewed

Binary file (74.3 kB)

documentacion.txt ADDED Viewed

	@@ -0,0 +1 @@


1	+ 1. La conexion entre Supabase y Hugginface se logro hacer, al realizarla por seguridad en Supabase (por defecto RLS "row-level security" estaba activado), retornaba una lista vacia, desabilitando esta opcion desde el proyecto de Supabase donde estaba el problema se logro solucionar y tener acceso

gemma3test.py DELETED Viewed

@@ -1,272 +0,0 @@
-import streamlit as st
-import os
-import time
-import pandas as pd
-from dotenv import load_dotenv
-from supabase import create_client, Client
-from transformers import pipeline
-from sentence_transformers import SentenceTransformer
-import plotly.graph_objects as go
-import numpy as np
-from sklearn.metrics.pairwise import cosine_similarity
-import re
-# ---------------------------------------------------------------------------------
-# Funciones auxiliares
-# ---------------------------------------------------------------------------------
-def extract_country_and_dates(prompt, countries):
-    country = None
-    start_date = None
-    end_date = None
-    # Buscar el país (insensible a mayúsculas y minúsculas)
-    for c in countries:
-        if re.search(r'\b' + re.escape(c) + r'\b', prompt, re.IGNORECASE):
-            country = c
-            break
-    # Buscar rangos de años con diferentes separadores (-, to, until, from ... to, between ... and)
-    date_ranges = re.findall(r'(\d{4})\s*(?:-|to|until|from.*?to|between.*?and)\s*(\d{4})', prompt, re.IGNORECASE)
-    if date_ranges:
-        start_date = date_ranges[0][0]
-        end_date = date_ranges[0][1]
-    else:
-        # Buscar un solo año
-        single_years = re.findall(r'\b(\d{4})\b', prompt)
-        if single_years:
-            start_date = single_years[0]
-            end_date = single_years[0]
-    return country, start_date, end_date
-def generate_plotly_graph(df, user_query, country=None, start_date=None, end_date=None):
-    relevant_data = df.copy()
-    if 'geo' in relevant_data.columns and country:
-        relevant_data = relevant_data[relevant_data['geo'].str.lower() == country.lower()]
-    if 'year' in relevant_data.columns:
-        relevant_data['year'] = pd.to_numeric(relevant_data['year'], errors='coerce').dropna().astype(int)
-        if start_date and end_date:
-            relevant_data = relevant_data[
-                (relevant_data['year'] >= int(start_date)) & (relevant_data['year'] <= int(end_date))
-                ]
-        elif start_date:
-            relevant_data = relevant_data[relevant_data['year'] >= int(start_date)]
-        elif end_date:
-            relevant_data = relevant_data[relevant_data['year'] <= int(end_date)]
-    numeric_cols = relevant_data.select_dtypes(include=['number']).columns.tolist()
-    if 'year' in relevant_data.columns and numeric_cols:
-        fig = go.Figure()
-        for col in numeric_cols:
-            if col != 'year':
-                fig.add_trace(go.Scatter(x=relevant_data['year'], y=relevant_data[col], mode='lines+markers', name=col))
-        title = f"Data for {country if country else 'All Regions'}"
-        if start_date and end_date:
-            title += f" ({start_date}-{end_date})"
-        elif start_date:
-            title += f" (from {start_date})"
-        elif end_date:
-            title += f" (up to {end_date})"
-        # Añadir título y etiquetas de los ejes
-        fig.update_layout(
-            title=title,
-            xaxis_title="Year",
-            yaxis_title="Value"  # Necesitaremos inferir o tener nombres de columnas más descriptivos
-        )
-        return fig
-    else:
-        return None
-# ---------------------------------------------------------------------------------
-# Configuración de conexión a Supabase
-# ---------------------------------------------------------------------------------
-load_dotenv()
-SUPABASE_URL = os.getenv("SUPABASE_URL")
-SUPABASE_KEY = os.getenv("SUPABASE_KEY")
-supabase: Client = create_client(SUPABASE_URL, SUPABASE_KEY)
-# Función para cargar datos de una tabla de Supabase
-def load_data(table):
-    try:
-        if supabase:
-            response = supabase.from_(table).select("*").execute()
-            if hasattr(response, 'data'):
-                return pd.DataFrame(response.data)
-            elif hasattr(response, '_error'):
-                st.error(f"Error fetching data: {response._error}")
-                return pd.DataFrame()
-            else:
-                st.info("Response object does not have 'data' or known error attributes. Check the logs.")
-                return pd.DataFrame()
-        else:
-            st.error("Supabase client not initialized. Check environment variables.")
-            return pd.DataFrame()
-    except Exception as e:
-        st.error(f"An error occurred during data loading: {e}")
-        return pd.DataFrame()
-# ---------------------------------------------------------------------------------
-# Cargar datos iniciales
-# ---------------------------------------------------------------------------------
-labor_data = load_data("labor")
-fertility_data = load_data("fertility")
-# ---------------------------------------------------------------------------------
-# Inicialización de modelos para RAG
-# ---------------------------------------------------------------------------------
-embedding_model = SentenceTransformer('all-MiniLM-L6-v2')
-llm_pipeline = pipeline("text-generation", model="google/gemma-3-1b-it", token=os.getenv("HF_TOKEN"))
-# ---------------------------------------------------------------------------------
-# Generación de Embeddings y Metadatos (en memoria)
-# ---------------------------------------------------------------------------------
-embeddings_list = []
-contents_list = []
-metadatas_list = []
-ids_list = []
-for index, row in labor_data.iterrows():
-    doc = f"Country: {row['geo']}, Year: {row['year']}, Employment Rate: {row['labour_force'] if 'labour_force' in row else 'N/A'}"
-    embeddings_list.append(embedding_model.encode(doc))
-    contents_list.append(doc)
-    metadatas_list.append({'country': row['geo'], 'year': str(row['year']), 'source': 'labor'})
-    ids_list.append(f"labor_{index}")
-for index, row in fertility_data.iterrows():
-    doc = f"Country: {row['geo']}, Year: {row['year']}, Fertility Rate: {row['fertility_rate'] if 'fertility_rate' in row else 'N/A'}"
-    embeddings_list.append(embedding_model.encode(doc))
-    contents_list.append(doc)
-    metadatas_list.append({'country': row['geo'], 'year': str(row['year']), 'source': 'fertility'})
-    ids_list.append(f"fertility_{index}")
-embeddings_array = np.array(embeddings_list)
-# ---------------------------------------------------------------------------------
-# Función para recuperar documentos relevantes (en memoria)
-# ---------------------------------------------------------------------------------
-def retrieve_relevant_documents_in_memory(query_embedding, stored_embeddings, contents, top_k=3):
-    similarities = cosine_similarity([query_embedding], stored_embeddings)[0]
-    sorted_indices = np.argsort(similarities)[::-1]
-    relevant_documents = [contents[i] for i in sorted_indices[:top_k]]
-    return relevant_documents
-# ---------------------------------------------------------------------------------
-# Generación de la explicación usando RAG
-# ---------------------------------------------------------------------------------
-def generate_rag_explanation(user_query, stored_embeddings, contents):
-    query_embedding = embedding_model.encode(user_query)
-    relevant_docs = retrieve_relevant_documents_in_memory(query_embedding, stored_embeddings, contents)
-    if relevant_docs:
-        context = "\n".join(relevant_docs)
-        augmented_prompt = f"Based on the following information:\n\n{context}\n\nAnswer the question related to: {user_query}"
-        output = llm_pipeline(augmented_prompt, max_length=250, num_return_sequences=1)
-        return output[0]['generated_text']
-    else:
-        return "No relevant information found to answer your query."
-# ---------------------------------------------------------------------------------
-# Generar la lista de países automáticamente
-# ---------------------------------------------------------------------------------
-available_countries_labor = labor_data['geo'].unique().tolist() if 'geo' in labor_data.columns else []
-available_countries_fertility = fertility_data['geo'].unique().tolist() if 'geo' in fertility_data.columns else []
-all_countries = list(set(available_countries_labor + available_countries_fertility))
-# ---------------------------------------------------------------------------------
-# Configuración de la app en Streamlit
-# ---------------------------------------------------------------------------------
-st.set_page_config(page_title="GraphGen", page_icon="🇪🇺")
-st.title("_Europe GraphGen_  :blue[Graph generator] :flag-eu:")
-st.caption("Mapping Europe's data with insights")
-if "messages" not in st.session_state:
-    st.session_state.messages = []
-    st.session_state.messages.append({"role": "assistant", "content": "What graphic and insights do you need?"})
-for message in st.session_state.messages:
-    with st.chat_message(message["role"]):
-        st.markdown(message["content"])
-prompt = st.chat_input("Type your message here...", key="chat_input_bottom")
-if prompt:
-    st.session_state.messages.append({"role": "user", "content": prompt})
-    with st.chat_message("user"):
-        st.markdown(prompt)
-    with st.spinner('Generating answer...'):
-        try:
-            # Determinar el año más reciente en los datos
-            latest_year_labor = labor_data['year'].max() if 'year' in labor_data else datetime.now().year
-            latest_year_fertility = fertility_data['year'].max() if 'year' in fertility_data else datetime.now().year
-            latest_year = max(latest_year_labor, latest_year_fertility, datetime.now().year)
-            country, start_date, end_date = extract_country_and_dates(prompt, all_countries, latest_year)
-            graph_displayed = False
-            # Analizar el prompt para determinar la intención del usuario
-            if re.search(r'\b(labor|employment|job|workforce)\b', prompt, re.IGNORECASE):
-                # Generar gráfica de datos laborales
-                labor_fig = generate_plotly_graph(labor_data, prompt, country, start_date, end_date)
-                if labor_fig:
-                    st.session_state.messages.append(
-                        {"role": "assistant", "content": "Here is the labor data graphic:"})
-                    with st.chat_message("assistant"):
-                        st.plotly_chart(labor_fig)
-                    graph_displayed = True
-            elif re.search(r'\b(fertility|birth|population growth)\b', prompt, re.IGNORECASE):
-                # Generar gráfica de datos de fertilidad
-                fertility_fig = generate_plotly_graph(fertility_data, prompt, country, start_date, end_date)
-                if fertility_fig:
-                    st.session_state.messages.append(
-                        {"role": "assistant", "content": "Here is the fertility data graphic:"})
-                    with st.chat_message("assistant"):
-                        st.plotly_chart(fertility_fig)
-                    graph_displayed = True
-            else:
-                # Si no se identifica una intención clara, intentar mostrar la gráfica de datos laborales primero
-                labor_fig = generate_plotly_graph(labor_data, prompt, country, start_date, end_date)
-                if labor_fig:
-                    st.session_state.messages.append(
-                        {"role": "assistant", "content": "Here is the labor data graphic:"})
-                    with st.chat_message("assistant"):
-                        st.plotly_chart(labor_fig)
-                    graph_displayed = True
-                elif not graph_displayed:
-                    fertility_fig = generate_plotly_graph(fertility_data, prompt, country, start_date, end_date)
-                    if fertility_fig:
-                        st.session_state.messages.append(
-                            {"role": "assistant", "content": "Here is the fertility data graphic:"})
-                        with st.chat_message("assistant"):
-                            st.plotly_chart(fertility_fig)
-                        graph_displayed = True
-            # Generar explicación usando RAG
-            explanation = generate_rag_explanation(prompt, embeddings_array, contents_list)
-            st.session_state.messages.append({"role": "assistant", "content": f"Explanation: {explanation}"})
-            with st.chat_message("assistant"):
-                st.markdown(f"**Explanation:** {explanation}")
-        except Exception as e:
-            st.session_state.messages.append({"role": "assistant", "content": f"Error generating answer: {e}"})
-            with st.chat_message("assistant"):
-                st.error(f"Error generating answer: {e}")
-if st.button("Clear chat"):
-    st.session_state.messages = []
-    st.session_state.messages.append(
-        {"role": "assistant", "content": "Chat has been cleared. What graphic and insights do you need now?"})
-    st.rerun()

gpt2test.py DELETED Viewed

@@ -1,332 +0,0 @@
-import streamlit as st
-import os
-import re
-import pandas as pd
-from dotenv import load_dotenv
-from supabase import create_client, Client
-from transformers import pipeline
-import plotly.express as px
-import plotly.graph_objects as go
-import time
-# ---------------------------------------------------------------------------------
-# Supabase Setup
-# ---------------------------------------------------------------------------------
-load_dotenv()
-SUPABASE_URL = os.getenv("SUPABASE_URL")
-SUPABASE_KEY = os.getenv("SUPABASE_KEY")
-supabase: Client = create_client(SUPABASE_URL, SUPABASE_KEY)
-# ---------------------------------------------------------------------------------
-# Data Loading Function
-# ---------------------------------------------------------------------------------
-def load_data(table):
-    try:
-        if supabase:
-            response = supabase.from_(table).select("*").execute()
-            if hasattr(response, 'data'):
-                return pd.DataFrame(response.data)
-            else:
-                st.error(f"Error fetching data or no data returned for table '{table}'. Check Supabase logs.")
-                return pd.DataFrame()
-        else:
-            st.error("Supabase client not initialized.")
-            return pd.DataFrame()
-    except Exception as e:
-        st.error(f"An error occurred during data loading from table '{table}': {e}")
-        return pd.DataFrame()
-# ---------------------------------------------------------------------------------
-# Helper Function Definitions
-# ---------------------------------------------------------------------------------
-def extract_country_from_prompt_regex(question, country_list):
-    """Extracts the first matching country from the list found in the question."""
-    for country in country_list:
-        # Use word boundaries (\b) for more accurate matching
-        if re.search(r"\b" + re.escape(country) + r"\b", question, re.IGNORECASE):
-            return country
-    return None  # Return None if no country in the list is found
-def extract_years_from_prompt(question):
-    """Extracts a single year or a start/end year range from a question string."""
-    start_year, end_year = None, None
-    # Pattern 1: Single year (e.g., "in 2010", "year 2010")
-    single_year_match = re.search(r'\b(in|year|del)\s+(\d{4})\b', question, re.IGNORECASE)
-    if single_year_match:
-        year = int(single_year_match.group(2))
-        return year, year  # Return single year as start and end
-    # Pattern 2: Year range (e.g., "between 2000 and 2010", "from 2005 to 2015")
-    range_match = re.search(r'\b(between|from)\s+(\d{4})\s+(and|to)\s+(\d{4})\b', question, re.IGNORECASE)
-    if range_match:
-        s_year = int(range_match.group(2))
-        e_year = int(range_match.group(4))
-        return min(s_year, e_year), max(s_year, e_year)  # Ensure start <= end
-    # Pattern 3: Simple range like "2000-2010"
-    simple_range_match = re.search(r'\b(\d{4})-(\d{4})\b', question)
-    if simple_range_match:
-        s_year = int(simple_range_match.group(1))
-        e_year = int(simple_range_match.group(2))
-        return min(s_year, e_year), max(s_year, e_year)
-    # Pattern 4: After Year (e.g., "after 2015")
-    after_match = re.search(r'\b(after|since)\s+(\d{4})\b', question, re.IGNORECASE)
-    if after_match:
-        start_year = int(after_match.group(2))
-        # end_year remains None, signifying >= start_year
-    # Pattern 5: Before Year (e.g., "before 2005")
-    before_match = re.search(r'\b(before)\s+(\d{4})\b', question, re.IGNORECASE)
-    if before_match:
-        end_year = int(before_match.group(2))
-        # start_year remains None, signifying <= end_year
-        # Special case: if 'after' wasn't also found, return (None, end_year)
-        if start_year is None:
-            return None, end_year
-    # Return extracted years (could be None, None; start, None; None, end; or start, end)
-    # If single year patterns were matched first, they returned already.
-    return start_year, end_year
-def filter_df_by_years(df, year_col, start_year, end_year):
-    """Filters a DataFrame based on a year column and a start/end year range."""
-    if year_col not in df.columns:
-        st.warning(f"Year column '{year_col}' not found.")
-        return df
-    try:
-        # Ensure year column is numeric, coerce errors to NaT/NaN
-        df[year_col] = pd.to_numeric(df[year_col], errors='coerce')
-        # Drop rows where conversion failed, essential for comparison
-        df_filtered = df.dropna(subset=[year_col]).copy()
-        # Convert to integer only AFTER dropping NaN, avoids potential float issues
-        df_filtered[year_col] = df_filtered[year_col].astype(int)
-    except Exception as e:
-        st.error(f"Could not convert year column '{year_col}' to numeric: {e}")
-        return df  # Return original on error
-    original_count = len(df_filtered)  # Count after potential NaNs are dropped
-    if start_year is None and end_year is None:
-        # No year filtering needed
-        return df_filtered
-    st.info(f"Filtering by years: Start={start_year}, End={end_year} on column '{year_col}'")
-    # Apply filters based on provided start/end years
-    if start_year is not None and end_year is not None:
-        # Specific range or single year (where start_year == end_year)
-        df_filtered = df_filtered[(df_filtered[year_col] >= start_year) & (df_filtered[year_col] <= end_year)]
-    elif start_year is not None:
-        # Only start year ("after X")
-        df_filtered = df_filtered[df_filtered[year_col] >= start_year]
-    elif end_year is not None:
-        # Only end year ("before Y")
-        df_filtered = df_filtered[df_filtered[year_col] <= end_year]
-    filtered_count = len(df_filtered)
-    if filtered_count == 0 and original_count > 0:  # Check if filtering removed all data
-        st.warning(f"No data found for the specified year(s): {start_year if start_year else ''}-{end_year if end_year else ''}")
-    elif filtered_count < original_count:
-        st.write(f"Filtered data by year. Rows reduced from {original_count} to {filtered_count}.")
-    return df_filtered
-# ---------------------------------------------------------------------------------
-# Load Model
-# ---------------------------------------------------------------------------------
-@st.cache_resource
-def load_gpt2():
-    try:
-        generator = pipeline('text-generation', model='openai-community/gpt2')
-        return generator
-    except Exception as e:
-        st.error(f"Failed to load GPT-2 model: {e}")
-        return None
-generator = load_gpt2()
-# ---------------------------------------------------------------------------------
-# Load Initial Data
-# ---------------------------------------------------------------------------------
-if 'data_labor' not in st.session_state:
-    st.session_state['data_labor'] = load_data("labor")  # Or your default table
-# ---------------------------------------------------------------------------------
-# Streamlit App UI Starts Here
-# ---------------------------------------------------------------------------------
-st.title("Análisis de Datos con GPT-2 y Visualización Automática")
-# Get the dataframe from session state
-df = st.session_state.get('data_labor')
-# --- Check if DataFrame is loaded ---
-if df is None or df.empty:
-    st.error("Failed to load data or data is empty. Please check Supabase connection and table 'labor'.")
-    # Optionally add a button to retry loading
-    if st.button("Retry Loading Data"):
-        st.session_state['data_labor'] = load_data("labor")
-        st.rerun()  # Rerun the script after attempting reload
-else:
-    # --- Section for the user question ---
-    st.subheader("Pregúntame algo sobre los datos de 'labor'")
-    question = st.text_input("Ejemplo: 'Cuál fue la fuerza laboral (labor force) en Germany entre 2010 y 2015?'")
-    if question:
-        # --- Main processing logic ---
-        st.write("--- Análisis de la pregunta ---")  # Debug separator
-        # Filter by Country
-        unique_countries = df['geo'].unique().tolist() if 'geo' in df.columns else []
-        extracted_country = extract_country_from_prompt_regex(question, unique_countries)
-        filtered_df = df.copy()
-        if extracted_country:
-            if 'geo' in filtered_df.columns:
-                filtered_df = filtered_df[filtered_df['geo'] == extracted_country]
-                st.success(f"Filtrando datos para el país: {extracted_country}")
-            else:
-                st.warning("Columna 'geo' no encontrada para filtrar por país.")
-        else:
-            st.info("No se especificó un país o no se encontró. Mostrando datos para todos los países disponibles.")
-        # Identify Columns
-        numerical_cols = [col for col in filtered_df.columns if pd.api.types.is_numeric_dtype(filtered_df[col])]
-        year_col_names = ['year', 'time', 'period', 'año']
-        year_cols = [col for col in filtered_df.columns if col.lower() in year_col_names and col in numerical_cols]
-        categorical_cols = [col for col in filtered_df.columns if pd.api.types.is_object_dtype(filtered_df[col]) and col != 'geo']
-        # Extract Years and Filter DataFrame
-        start_year, end_year = extract_years_from_prompt(question)
-        year_col_to_use = None
-        if year_cols:
-            year_col_to_use = year_cols[0]
-            filtered_df = filter_df_by_years(filtered_df, year_col_to_use, start_year, end_year)
-        else:
-            st.warning("No se pudo identificar una columna de año numérica para filtrar.")
-        # --- GPT-2 Description Generation ---
-        if generator:  # Check if model loaded successfully
-            st.subheader("Descripción Automática (GPT-2)")
-            # Create a concise context
-            context_description = "The dataset contains labor data"
-            context_info = f"Data for {extracted_country or 'all countries'}"
-            if extracted_country:
-                # If a specific country is filtered, mention it clearly
-                context_description += f" specifically for {extracted_country}"
-            else:
-                # Otherwise, mention the broader scope if known (e.g., Europe)
-                # If you load data for multiple countries by default, state that
-                context_description += " covering multiple countries"  # Adjust if needed
-            if year_col_to_use and (start_year is not None or end_year is not None):
-                context_info += f" between years {start_year if start_year else 'start'} and {end_year if end_year else 'end'}"
-            context_info += f". Columns include: {', '.join(filtered_df.columns.tolist())}."
-            prompt = f"{context_info}\n\nQuestion: {question}\nAnswer based ONLY on the provided context:"
-            try:
-                st.info("Generando descripción...")  # Let user know it's working
-                description = generator(prompt, max_new_tokens=200, num_return_sequences=1)[0]['generated_text']
-                # Clean up the output to show only the answer part
-                answer_part = description.split(prompt)[-1]  # Split by the prompt itself
-                st.success("Descripción generada:")
-                st.write(answer_part.strip())
-            except Exception as e:
-                st.error(f"Error generando descripción con GPT-2: {e}")
-        else:
-            st.warning("El modelo GPT-2 no está cargado. No se puede generar descripción.")
-        # --- Visualization Section ---
-        st.subheader("Visualización Automática")
-        if filtered_df.empty:
-            st.warning("No hay datos para mostrar después de aplicar los filtros.")
-        # --- Logic for LINE PLOT ---
-        elif year_col_to_use and numerical_cols:
-            start_time_graph = time.time()
-            potential_y_cols = [col for col in numerical_cols if col != year_col_to_use]
-            y_col = None
-            if not potential_y_cols:
-                st.warning(f"No se encontraron columnas numéricas de datos (aparte de '{year_col_to_use}') para graficar contra el año.")
-            else:
-                labor_keywords = ['labor', 'labour', 'workforce', 'employment', 'lfpr', 'fuerza']  # Added 'fuerza'
-                found_labor_col = False
-                for col in potential_y_cols:
-                    if any(keyword in col.lower() for keyword in labor_keywords):
-                        y_col = col
-                        st.info(f"Se encontró columna relevante: '{y_col}'. Usándola para el eje Y.")
-                        found_labor_col = True
-                        break
-                if not found_labor_col:
-                    y_col = potential_y_cols[0]
-                    st.info(f"No se encontró columna específica. Usando la primera columna numérica disponible ('{y_col}') para el eje Y.")
-            if y_col:
-                x_col = year_col_to_use
-                fig = go.Figure()
-                title = f"{y_col} vs {x_col}"
-                if extracted_country:
-                    title += f" en {extracted_country}"
-                if start_year is not None or end_year is not None:
-                    year_range_str = ""
-                    if start_year is not None:
-                        year_range_str += str(start_year)
-                    if end_year is not None:
-                        year_range_str += f"-{end_year}" if start_year is not None else str(end_year)
-                    if year_range_str:
-                        title += f" ({year_range_str})"
-                df_plot = filtered_df.sort_values(by=x_col)
-                if y_col in df_plot.columns and x_col in df_plot.columns:
-                    # Add color based on 'sex' if available
-                    if 'sex' in df_plot.columns:
-                        for sex_val in df_plot['sex'].unique():
-                            df_subset = df_plot[df_plot['sex'] == sex_val]
-                            fig.add_trace(go.Scatter(x=df_subset[x_col], y=df_subset[y_col], mode='lines+markers', name=str(sex_val)))
-                        fig.update_layout(title=title, xaxis_title=x_col, yaxis_title=y_col)
-                    else:
-                        fig.add_trace(go.Scatter(x=df_plot[x_col], y=df_plot[y_col], mode='lines+markers', name=y_col))
-                        fig.update_layout(title=title, xaxis_title=x_col, yaxis_title=y_col)
-                    st.plotly_chart(fig)
-                    end_time_graph = time.time()
-                    st.write(f"Gráfico generado en: {end_time_graph - start_time_graph:.4f} segundos")
-                else:
-                    st.warning("Las columnas X o Y seleccionadas no existen en los datos filtrados.")
-        # --- Logic for SCATTER PLOT ---
-        elif numerical_cols and len(numerical_cols) >= 2:
-            start_time_graph = time.time()
-            st.subheader("Gráfico de Dispersión Sugerido")
-            col1 = st.selectbox("Selecciona la primera columna numérica para el gráfico de dispersión:", numerical_cols)
-            col2 = st.selectbox("Selecciona la segunda columna numérica para el gráfico de dispersión:", [c for c in numerical_cols if c != col1])
-            if col1 and col2:
-                fig = px.scatter(filtered_df, x=col1, y=col2, title=f"Gráfico de Dispersión: {col1} vs {col2}")
-                st.plotly_chart(fig)
-                end_time_graph = time.time()
-                st.write(f"Gráfico generado en: {end_time_graph - start_time_graph:.4f} segundos")
-            else:
-                st.warning("Las columnas X o Y seleccionadas no existen en los datos filtrados.")
-        # --- Logic for SCATTER PLOT ---
-        # (Your scatter plot logic here...)
-        elif numerical_cols and len(numerical_cols) >= (2 + (1 if year_col_to_use else 0)) :
-             # ... (scatter plot code, ensuring cols exist) ...
-             pass # Placeholder
-        # --- Logic for BAR CHART ---
-        # (Your bar chart logic here...)
-        elif numerical_cols and categorical_cols:
-             # ... (bar chart code, ensuring cols exist and aggregating if needed) ...
-             pass # Placeholder
-        else:
-            # Only show this if no plots were generated above
-            if not (year_col_to_use and y_col): # Check if line plot was attempted
-                 st.info("No se encontraron columnas adecuadas o suficientes datos después del filtrado para generar un gráfico automáticamente.")

requirements.txt CHANGED Viewed

@@ -1,6 +1,14 @@
-pandas
-numpy
-streamlit
-python-dotenv
-matplotlib
-pandasai

+# supabase
+# transformers
+# matplotlib
+# pandas
+# torch
+# pandasai
+streamlit>=1.30.0
+pandas>=2.2.0
+numpy>=1.25.0
+python-dotenv>=1.0.0
+transformers>=4.37.2
+pandasai>=2.0.0
+supabase>=2.0.0