Files
supabase/apps/docs/content/guides/ai/quickstarts/text-deduplication.mdx
Charis 47705a8968 chore: replace all supabase urls with relative urls (#38537)
* fix: rewrite relative URLs when syncing to GitHub discussion

Relative URLs back to supabse.com won't work in GitHub discussions, so
rewrite them back to absolute URLs starting with https://supabase.com

* fix: replace all supabase urls with relative urls

* chore: add linting for relative urls

* chore: bump linter version

* Prettier

---------

Co-authored-by: Chris Chinchilla <chris.ward@supabase.io>
2025-09-09 12:54:33 +00:00

71 lines
2.9 KiB
Plaintext

---
id: 'ai-vecs-python-client'
title: 'Semantic Text Deduplication'
subtitle: 'Finding duplicate movie reviews with Supabase Vecs.'
breadcrumb: 'AI Quickstarts'
---
This guide will walk you through a ["Semantic Text Deduplication"](https://github.com/supabase/supabase/blob/master/examples/ai/semantic_text_deduplication.ipynb) example using Colab and Supabase Vecs. You'll learn how to find similar movie reviews using embeddings, and remove any that seem like duplicates. You will:
1. Launch a Postgres database that uses pgvector to store embeddings
1. Launch a notebook that connects to your database
1. Load the IMDB dataset
1. Use the `sentence-transformers/all-MiniLM-L6-v2` model to create an embedding representing the semantic meaning of each review.
1. Search for all duplicates.
<$Partial path="database_setup.mdx" />
## Launching a notebook
Launch our [`semantic_text_deduplication`](https://github.com/supabase/supabase/blob/master/examples/ai/semantic_text_deduplication.ipynb) notebook in Colab:
<a
className="w-64"
href="https://colab.research.google.com/github/supabase/supabase/blob/master/examples/ai/semantic_text_deduplication.ipynb"
>
<img src="/docs/img/ai/colab-badge.svg" />
</a>
At the top of the notebook, you'll see a button `Copy to Drive`. Click this button to copy the notebook to your Google Drive.
## Connecting to your database
Inside the Notebook, find the cell which specifies the `DB_CONNECTION`. It will contain some code like this:
```python
import vecs
DB_CONNECTION = "postgresql://<user>:<password>@<host>:<port>/<db_name>"
# create vector store client
vx = vecs.create_client(DB_CONNECTION)
```
Replace the `DB_CONNECTION` with your own connection string. You can find the connection string on your project dashboard by clicking [Connect](/dashboard/project/_?showConnect=true).
<Admonition type='note'>
SQLAlchemy requires the connection string to start with `postgresql://` (instead of `postgres://`). Don't forget to rename this after copying the string from the dashboard.
</Admonition>
<Admonition type='note'>
You must use the "connection pooling" string (domain ending in `*.pooler.supabase.com`) with Google Colab since Colab does not support IPv6.
</Admonition>
## Stepping through the notebook
Now all that's left is to step through the notebook. You can do this by clicking the "execute" button (`ctrl+enter`) at the top left of each code cell. The notebook guides you through the process of creating a collection, adding data to it, and querying it.
You can view the inserted items in the [Table Editor](/dashboard/project/_/editor/), by selecting the `vecs` schema from the schema dropdown.
![Colab documents](/docs/img/ai/google-colab/colab-documents.png)
<$Partial path="ai/quickstart_hf_deployment.mdx" />
## Next steps
You can now start building your own applications with Vecs. Check our [examples](/docs/guides/ai#examples) for ideas.