Spaces:
Sleeping
Sleeping
| # π§ Community Contribution: Async Playwright-based OpenAI Scraper | |
| This contribution presents a fully asynchronous, headless-browser-based scraper for [https://openai.com](https://openai.com) using **Playwright** β an alternative to Selenium. | |
| Developed by: [lakovicb](https://github.com/lakovicb) | |
| IDE used: WingIDE Pro (Jupyter compatibility via `nest_asyncio`) | |
| --- | |
| ## π¦ Features | |
| - π§ Simulates human-like interactions (mouse movement, scrolling) | |
| - π§ GPT-based analysis using OpenAI's API | |
| - π§ͺ Works inside **JupyterLab** using `nest_asyncio` | |
| - π Prometheus metrics for scraping observability | |
| - β‘ Smart content caching via `diskcache` | |
| --- | |
| ## π How to Run | |
| ### 1. Install dependencies | |
| ```bash | |
| pip install -r requirements.txt | |
| ``` | |
| > Ensure [Playwright is installed & browsers are downloaded](https://playwright.dev/python/docs/intro) | |
| ```bash | |
| playwright install | |
| ``` | |
| ### 2. Set environment variables in `.env` | |
| ```env | |
| OPENAI_API_KEY=your_openai_key | |
| BROWSER_PATH=/usr/bin/chromium-browser | |
| ``` | |
| You can also define optional proxy/login params if needed. | |
| --- | |
| ## π Notebooks Included | |
| | Notebook | Description | | |
| |----------|-------------| | |
| | `Playwright_Solution_JupyterAsync.ipynb` | Executes async scraper directly inside Jupyter | | |
| | `Playwright_Solution_Showcase_Formatted.ipynb` | Nicely formatted output for human reading | | |
| --- | |
| ## π Output Example | |
| - GPT-generated summary | |
| - Timeline of updates | |
| - Entities and projects mentioned | |
| - Structured topics & themes | |
| β *Can be extended with PDF export, LangChain pipeline, or vector store ingestion.* | |
| --- | |
| ## π Thanks | |
| Huge thanks to Ed Donner for the amazing course and challenge inspiration! | |