stefanjwojcik's picture
first commit
9ff0a35 verified
## How to run some of the code in this repository
### 1. Make sure Docker is installed on your machine
### 2. Clone the repository
### 3. CD into the repository
### 4. Run the following command to build the docker image
```bash
docker docker compose build -t oc-prototype .
```
### 5. Run the following command to run the docker image
```bash
docker compose up -d oc-prototype
docker exec -it oc-prototype /bin/bash
```
## Prototype TODO's
## Data
- [X] Process all misinfo claims and generate embeddings for a library namespace
- [X] Upsert claims into pinecone
- [X] Upsert 300k into namespace
- [ ] Update claim format to be similar to: https://www.kaggle.com/datasets/shivkumarganesh/politifact-factcheck-data/data
## Functions
- [X] Upsert vector
- [X] Batch upsert
- [X] Query against metadata
- [ ] Generate working Dockerfile for project reproducibility
- [ ] Load data into a database
- [ ] Test precision/recall of embeddings
- [ ] Generate working version of climate demo
Embedding pricing:
1 token = approximately 0.75 words or 1k tokens = 750 words, you pay per 1000 tokens $0.0001
Using that it can be shown that you get about 4 characters per token or 4Kb of embedding text per 1k tokens or $0.0001
Using that as your basis you can approximate the cost of your embedding by :
Cost in $ = Size of Data in Kilobytes * 0.000025
$0.100 / 1M tokens
Credentials for running google cloud queries: see ostreacultura-credentials.json