DVF Data¶
DVF (Demandes de Valeurs Foncières) is France's open dataset of real estate transactions. AppArt Agent uses this geolocalized data for price analysis and market trends.
Overview¶
| Metric | Value |
|---|---|
| Total Sales | 4.8M+ |
| Total Lots | 13.5M+ |
| Years Covered | 2015-2025 |
| Update Frequency | Quarterly |
| Source | data.gouv.fr |
| Geographic Coverage | Nationwide with lat/lon |
Data Source¶
Official DVF data is published by the French government:
- URL: https://www.data.gouv.fr/fr/datasets/demandes-de-valeurs-foncieres-geolocalisees/
- Format: CSV with geolocalized coordinates
- File Size: ~3.5GB uncompressed (~20M rows)
- Fields: 43 columns including address, price, surface area, property type, latitude, longitude
Key difference from original DVF: This is a pre-processed geolocalized version with:
- Computed lat/lon coordinates for each transaction
- Cleaned and normalized addresses
- Aggregated by unique transaction (id_mutation)
- One row per lot sold
New Schema (March 2026)¶
The project migrated from the old monolithic dvf_records table to a normalized two-table schema:
DVFSale (Transactions)¶
One row per real estate transaction (unique id_mutation).
class DVFSale(Base):
__tablename__ = "dvf_sales"
id: int # Primary key
id_mutation: str # Unique transaction ID (indexed)
date_mutation: date # Sale date
nature_mutation: str # Type: Vente, Échange, etc.
prix: float # Total transaction price
adresse_numero: str # Street number
adresse_nom_voie: str # Street name
adresse_complete: str # Computed full address (GIN indexed)
code_postal: str # Postal code (indexed)
code_commune: str # INSEE commune code
nom_commune: str # City name
code_departement: str # Department code
longitude: float # Longitude
latitude: float # Latitude
type_principal: str # Primary property type (Appartement, Maison, etc.)
surface_bati: float # Total built surface
nombre_pieces: int # Total rooms
surface_terrain: float # Land surface
nombre_lots: int # Number of lots in transaction
n_appartements: int # Number of apartments
n_maisons: int # Number of houses
n_dependances: int # Number of dependencies
n_parcelles_terrain: int # Number of land parcels
prix_m2: float # Price per sqm (computed, indexed)
annee: int # Year (extracted from date_mutation)
Indexes:
- Composite:
(code_postal, type_principal) - Composite:
(code_postal, type_principal, date_mutation)for trend queries - GIN trigram:
adresse_completefor fuzzy address matching - Partial:
prix_m2 WHERE prix_m2 IS NOT NULLfor price analysis
DVFSaleLot (Individual Lots)¶
One row per lot within a transaction (many-to-one with DVFSale).
class DVFSaleLot(Base):
__tablename__ = "dvf_sale_lots"
id: int # Primary key
id_mutation: str # Foreign key to DVFSale
lot_type: str # Type: Appartement, Maison, Dépendance, Terrain
nature_culture: str # Land type (for terrain lots)
surface_bati: float # Built surface of this lot
nombre_pieces: int # Rooms in this lot
surface_terrain: float # Land surface of this lot
id_parcelle: str # Cadastral parcel ID
longitude: float # Lot-specific longitude (if available)
latitude: float # Lot-specific latitude (if available)
Why two tables?
- One transaction (
id_mutation) can involve multiple lots (e.g., apartment + parking) - Each lot has its own type, surface, and sometimes coordinates
- Aggregations at transaction level (DVFSale) are much faster
- Detailed lot analysis available when needed (DVFSaleLot)
Data Import¶
CLI Tools¶
Two new CLI commands are available via uv run:
1. Download DVF¶
# Download the latest geolocalized DVF dataset
uv run download-dvf https://www.data.gouv.fr/fr/datasets/r/3004168d-bec4-44d9-a781-ef16f41856a2
# This will:
# - Download the CSV file (~3.5GB)
# - Extract to data/dvf/
# - Verify integrity
Entry point: backend/scripts/download_dvf.py (registered in root pyproject.toml)
2. Import DVF¶
# Import downloaded DVF data into PostgreSQL
uv run import-dvf
# Options:
# --source data/dvf/dvf_geolocalized.csv # Custom source file
# --limit 100000 # Import only first N rows (for testing)
Entry point: backend/scripts/import_dvf.py (registered in root pyproject.toml)
Import Process¶
The new importer uses Polars for fast CSV processing and PostgreSQL COPY for bulk inserts. COPY operations are done in 500k-row chunks with progress logging for visibility.
flowchart TD
A["1. Load CSV<br/>Polars reads ~20M rows<br/>~15GB in memory"] --> B
B["2. Group by id_mutation<br/>Aggregate lot-level data<br/>Compute transaction totals"] --> C
C["3. Create DVFSale rows<br/>~4.8M unique transactions<br/>Compute adresse_complete, prix_m2"] --> D
D["4. Bulk insert DVFSales<br/>Chunked COPY (500k rows/batch)<br/>~30s for 4.8M rows"] --> E
E["5. Create DVFSaleLot rows<br/>~13.5M individual lots<br/>Preserve lot details"] --> F
F["6. Bulk insert DVFSaleLots<br/>Chunked COPY (500k rows/batch)<br/>~25s for 13.5M rows"]
Performance¶
| Environment | Total Import Time | Notes |
|---|---|---|
| Local | ~55 seconds | Polars + COPY, no download time |
| Production (Cloud Run) | ~25 minutes | Includes GCS download, chunked COPY to Cloud SQL |
| Metric | Value |
|---|---|
| DVFSale Insert | ~30 seconds (4.8M rows) |
| DVFSaleLot Insert | ~25 seconds (13.5M rows) |
| Memory Usage | ~15GB peak (Polars groupby) |
| CPU | 4+ vCPU recommended |
Why so fast?
- Polars: columnar processing, parallel execution, lazy evaluation
- COPY FROM STDIN: PostgreSQL bulk insert protocol (50x faster than INSERT)
- No row-by-row processing: bulk operations only
Street Address Normalization¶
The importer normalizes street addresses to match DVF abbreviations:
STREET_TYPE_MAPPING = {
"BOULEVARD": "BD",
"AVENUE": "AV",
"RUE": "RUE",
"PLACE": "PL",
"IMPASSE": "IMP",
"CHEMIN": "CHE",
"ALLEE": "ALL",
# ... etc
}
This enables accurate address matching in DVFService.get_exact_address_sales().
GCS URI Support¶
The import script supports gs:// URIs for same-region downloads, which bypass the public internet and use Google Cloud's internal network for faster, more reliable transfers.
# Local file
uv run import-dvf --source data/dvf/dvf_geolocalized.csv
# GCS URI (production)
export DVF_SOURCE_URL=gs://my-bucket/dvf_geolocalized.csv
uv run import-dvf
DVF Service¶
The DVFService class provides high-level methods for price analysis:
Methods¶
| Method | Purpose |
|---|---|
get_exact_address_sales() |
Find historical sales at exact address |
get_comparable_sales() |
Find similar properties in same postal code |
get_neighboring_sales_for_trend() |
Get recent sales in area for trend calculation |
calculate_market_trend() |
Compute monthly price trend from neighboring sales |
calculate_trend_based_projection() |
Project future price using trend |
calculate_price_analysis() |
Full analysis: exact + comparable + trend projection |
Address Autocomplete¶
The frontend's address search field uses api-adresse.data.gouv.fr (French government geocoding API) for instant autocomplete suggestions. This provides:
- Real-time address suggestions as the user types
- Standardized addresses with postal codes
- Geographic coordinates for mapping
DVF matching happens at price analysis time via:
- Postal code (from selected address)
- Normalized street name (using
STREET_TYPE_MAPPING)
This two-step approach ensures fast autocomplete UX while maintaining accurate DVF matching.
Example: Price Analysis¶
from app.services.dvf_service import DVFService
service = DVFService(db)
result = service.calculate_price_analysis(
property_id=123,
address="56 RUE NOTRE-DAME DES CHAMPS",
postal_code="75006",
surface_area=65.5,
rooms=3,
property_type="Appartement"
)
# Returns:
{
"exact_match_sales": [...], # Sales at same address
"comparable_sales": [...], # Similar properties in 75006
"excluded_sale_ids": [1, 2, 3], # Outliers removed by IQR
"market_trend_monthly": 0.008, # 0.8% monthly growth
"market_trend_annual": 0.096, # 9.6% annual growth
"projected_price": 875000, # Estimated value
"projected_price_m2": 13360,
"confidence": "medium",
"excluded_neighboring_sale_ids": [4, 5]
}
IQR Outlier Filtering¶
All analyses use IQR (Interquartile Range) to remove outliers:
def filter_outliers(values: List[float]) -> List[float]:
q1 = np.percentile(values, 25)
q3 = np.percentile(values, 75)
iqr = q3 - q1
lower = q1 - 1.5 * iqr
upper = q3 + 1.5 * iqr
return [v for v in values if lower <= v <= upper]
Outliers are tracked in excluded_sale_ids and excluded_neighboring_sale_ids.
Price Analysis Table¶
Results are cached in the price_analyses table:
class PriceAnalysis(Base):
__tablename__ = "price_analyses"
id: int
property_id: int # Foreign key to properties
analysis_data: JSON # Full DVFService result
excluded_sale_ids: JSON # Outlier IDs removed
excluded_neighboring_sale_ids: JSON # Trend outlier IDs removed
created_at: datetime
updated_at: datetime
This avoids re-computing expensive analyses and preserves historical snapshots.
Production Deployment (GCP Cloud Run Job)¶
Cloud Run Job: dvf-import¶
The import runs as a Cloud Run Job (not a service) to avoid timeout limits.
The job is defined in Terraform (infra/terraform/main.tf) and triggered manually.
# Execute the import
gcloud run jobs execute dvf-import --region europe-west1 --wait
# Check logs
gcloud run jobs executions logs dvf-import --region europe-west1
Configuration:
- Memory: 32 GiB
- CPU: 8 vCPU (Polars parallelism)
- Timeout: 60 minutes
- Max retries: 0 (fail fast for easier debugging)
- VPC egress:
PRIVATE_RANGES_ONLY— Cloud SQL traffic goes through the VPC connector, public downloads (data.gouv.fr) bypass the VPC directly to the internet - Environment:
PYTHONUNBUFFERED=1for real-time log visibility - Trigger: Manual (via GitHub Actions workflow or
gcloudcommand)
GitHub Actions Workflows¶
1. dvf-import.yml¶
Manual workflow to trigger DVF import. Optionally accepts a custom source URL.
2. deploy.yml¶
Updates the DVF job image on each deploy to main so the next import
execution uses the latest code.
Query Examples¶
Find Sales at Address¶
from app.services.dvf_service import DVFService
service = DVFService(db)
sales = service.get_exact_address_sales(
address="56 RUE NOTRE-DAME DES CHAMPS",
postal_code="75006"
)
for sale in sales:
print(f"{sale.date_mutation}: {sale.prix:,.0f} EUR ({sale.prix_m2:,.0f} EUR/m²)")
Comparable Sales (Same Postal Code + Type)¶
comparables = service.get_comparable_sales(
postal_code="75006",
property_type="Appartement",
surface_area=65.5,
rooms=3,
max_results=50
)
# Filters by:
# - Same postal code
# - Same property type
# - Surface within 20% (52.4-78.6 m²)
# - Rooms within ±1 (2-4 rooms)
# - Last 2 years
# - Sorted by date descending
Market Trend (Neighboring Sales)¶
trend = service.calculate_market_trend(
postal_code="75006",
property_type="Appartement",
surface_area=65.5,
rooms=3
)
print(f"Monthly trend: {trend['monthly_change']:.2%}")
print(f"Annual trend: {trend['annual_change']:.2%}")
Data Validation¶
Check Import Completeness¶
# Total sales
docker-compose exec db psql -U appart -d appart_agent -c "
SELECT COUNT(*) as total_sales FROM dvf_sales;
"
# Total lots
docker-compose exec db psql -U appart -d appart_agent -c "
SELECT COUNT(*) as total_lots FROM dvf_sale_lots;
"
# Sales by year
docker-compose exec db psql -U appart -d appart_agent -c "
SELECT annee, COUNT(*) as count
FROM dvf_sales
GROUP BY annee
ORDER BY annee;
"
Check Index Usage¶
docker-compose exec db psql -U appart -d appart_agent -c "
SELECT
schemaname,
tablename,
indexname,
idx_scan as times_used
FROM pg_stat_user_indexes
WHERE tablename IN ('dvf_sales', 'dvf_sale_lots')
ORDER BY idx_scan DESC;
"
Test Address Search¶
from app.core.database import SessionLocal
from app.models.property import DVFSale
from sqlalchemy import func
db = SessionLocal()
# Fuzzy search using trigram similarity
sales = db.query(DVFSale).filter(
DVFSale.code_postal == "75006",
func.similarity(DVFSale.adresse_complete, "56 RUE NOTRE DAME CHAMPS") > 0.3
).order_by(
func.similarity(DVFSale.adresse_complete, "56 RUE NOTRE DAME CHAMPS").desc()
).limit(10).all()
for sale in sales:
print(f"{sale.adresse_complete} ({sale.date_mutation}): {sale.prix:,.0f} EUR")
db.close()
Troubleshooting¶
Missing Sales¶
Issue: Address search returns no results.
Solutions:
- Check if postal code exists:
- Check address normalization:
from app.services.dvf_service import normalize_street
input_addr = "56 Boulevard Notre-Dame des Champs"
normalized = normalize_street(input_addr)
print(f"Searching for: {normalized}") # "56 BD NOTRE DAME DES CHAMPS"
- Use fuzzy search instead of exact match:
# Exact match (may fail due to typos)
sales = db.query(DVFSale).filter_by(
adresse_complete="56 RUE NOTRE DAME CHAMPS",
code_postal="75006"
).all()
# Fuzzy match (more forgiving)
sales = db.query(DVFSale).filter(
DVFSale.code_postal == "75006",
func.similarity(DVFSale.adresse_complete, "56 RUE NOTRE DAME CHAMPS") > 0.3
).all()
Slow Queries¶
Issue: Address search takes too long (>1s).
Solutions:
- Verify GIN index exists:
- Analyze table statistics:
- Check query plan:
EXPLAIN ANALYZE
SELECT * FROM dvf_sales
WHERE code_postal = '75006'
AND similarity(adresse_complete, '56 RUE NOTRE DAME CHAMPS') > 0.3
ORDER BY similarity(adresse_complete, '56 RUE NOTRE DAME CHAMPS') DESC
LIMIT 50;
Expected: Index Scan using idx_dvf_sales_composite or idx_dvf_sales_address_gin
Import Failures¶
Issue: Import crashes or runs out of memory.
Solutions:
- Increase available memory (15GB+ recommended)
- Use
--limitflag for testing:
- Check disk space for PostgreSQL:
- Check PostgreSQL logs:
Migration from Old Schema¶
The migration from dvf_records → dvf_sales + dvf_sale_lots is handled by Alembic:
- Migration file:
backend/alembic/versions/l3m4n5o6p7q8_migrate_dvf_to_geolocalized_schema.py - Drops:
dvf_records,dvf_imports,dvf_stats,dvf_grouped_transactionsview - Creates:
dvf_sales,dvf_sale_lots, indexes
Run migration:
No data is automatically migrated. After running the migration, you must re-import DVF data using uv run import-dvf.