How to run some of the code in this repository
1. Make sure Docker is installed on your machine
2. Clone the repository
3. CD into the repository
4. Run the following command to build the docker image
docker docker compose build -t oc-prototype .
5. Run the following command to run the docker image
docker compose up -d oc-prototype
docker exec -it oc-prototype /bin/bash
Prototype TODO's
Data
- Process all misinfo claims and generate embeddings for a library namespace
- Upsert claims into pinecone
- Upsert 300k into namespace
- Update claim format to be similar to: https://www.kaggle.com/datasets/shivkumarganesh/politifact-factcheck-data/data
Functions
Upsert vector
Batch upsert
Query against metadata
Generate working Dockerfile for project reproducibility
Load data into a database
Test precision/recall of embeddings
Generate working version of climate demo
Embedding pricing:
1 token = approximately 0.75 words or 1k tokens = 750 words, you pay per 1000 tokens $0.0001 Using that it can be shown that you get about 4 characters per token or 4Kb of embedding text per 1k tokens or $0.0001 Using that as your basis you can approximate the cost of your embedding by : Cost in $ = Size of Data in Kilobytes * 0.000025
$0.100 / 1M tokens
Credentials for running google cloud queries: see ostreacultura-credentials.json