Welcome to F97.BE!

Python, Photons, and Predictions.

How the AI Forecast Was Created

This section explains, step by step, how the AI-based forecast for daily solar energy production is created. It combines real production data from the solar inverter with weather data using machine learning, resulting in more accurate predictions than traditional physics-based models alone. Below you'll find the entire process – from raw data to final forecast – illustrated with examples.

1. Data Collection: Where the Model Gets Its Information

To make accurate predictions, the model relies on a robust dataset built from over 1,300 consecutive days of historical information. This data is stored in a CSV file named energy_weather_dataset.csv, with each row representing a single calendar day. For every entry, the dataset combines measured solar energy production with the corresponding weather conditions on that day.

Since the installation of the solar inverter in 2021, I have automated the download and logging of all available inverter data. A scheduled script runs daily on a local system to collect production data directly from the inverter and append it to a growing historical log. This automation ensures that the dataset remains up to date with minimal manual intervention.

To enrich the dataset further, I integrated historical weather data by polling daily values from the Open-Meteo API. These weather parameters were then matched to each corresponding day and appended to the CSV file, providing comprehensive environmental context for each energy reading.

Here’s a sample excerpt from the dataset:


          date,energy_kwh,temp_max,temp_min,cloud_cover,precipitation,wind_speed,radiation
          2023-05-01,38.45,21.3,11.2,0.2,0.0,5.3,6.7
          2023-05-02,33.22,19.8,10.0,0.6,0.0,6.1,5.8
          2023-05-03,16.10,14.0,8.9,0.9,1.2,4.2,2.4
          

The key column is energy_kwh, which records the actual amount of energy (in kilowatt-hours) produced by the solar installation on that day. The remaining columns capture a variety of meteorological factors that influence solar output:

  • temp_max and temp_min: The daily maximum and minimum temperatures, in degrees Celsius.
  • cloud_cover: A normalized value from 0 (clear skies) to 1 (fully overcast), indicating the extent of cloudiness.
  • precipitation: The amount of rainfall in millimeters.
  • wind_speed: The daily average wind speed, in meters per second.
  • radiation: The total solar radiation received, measured in kilowatt-hours per square meter.

By correlating energy output with these weather parameters, the model learns how environmental conditions affect daily solar production—and can apply that knowledge to generate accurate future forecasts.

2. Feature Engineering: Turning Raw Data Into Model Inputs

Before the dataset can be used to train a machine learning model, it undergoes a critical preprocessing step known as feature engineering. This process transforms raw values into a structured format that the model can interpret and learn from effectively.

Each row in the dataset is converted into a feature vector—a numeric array containing selected weather attributes for that day. This vector serves as the input (X) for the model, while the corresponding energy production (y) becomes the target output.


          X = [21.3, 11.2, 0.2, 0.0, 5.3, 6.7]  →  y = 38.45 kWh
               ↑     ↑    ↑    ↑    ↑    ↑
             Tmax  Tmin  CC  Precip  Wind  Radiation
          

Here’s what each feature represents:

  • Tmax: Maximum temperature of the day (°C)
  • Tmin: Minimum temperature of the day (°C)
  • CC (Cloud Cover): Fraction of the sky covered by clouds (0 = clear, 1 = overcast)
  • Precip: Precipitation amount in millimeters
  • Wind: Average wind speed (m/s)
  • Radiation: Total solar radiation received (kWh/m²)

To improve the model’s ability to detect patterns over time, temporal features may also be added. For example, the day_of_year can be included to give the model a sense of seasonal progression. To better handle the cyclical nature of seasons, the day number is often transformed using a cosine function:


          seasonal = cos(2π * day_of_year / 365)
          

This transformation ensures that days like January 1st and December 31st—though numerically distant—are recognized as seasonally similar by the model.

These engineered features form the core of the input space used for training. They allow the model to learn how different weather conditions—and their seasonal variations—affect solar energy production.

3. Model Evaluation: Why Random Forest and Gradient Boosting?

Early in the project, I experimented with a wide range of machine learning models to predict solar energy output based on historical weather and inverter data. These included:

  • Linear Regression
  • Support Vector Regression (SVR)
  • k-Nearest Neighbors (KNN)
  • Decision Trees
  • Random Forest Regressor
  • Gradient Boosting Regressor

While simpler models like Linear Regression and KNN provided quick results, they often failed to capture the complex, nonlinear relationships between weather variables and solar production — especially under edge conditions like partial cloud cover or winter sun angles.

In contrast, both Random Forest and Gradient Boosting consistently delivered the best results during cross-validation. Each has its strengths:

  • Random Forest is highly robust to noise and overfitting, and handles diverse input features well without much parameter tuning.
  • Gradient Boosting often achieves slightly higher accuracy by learning from its own mistakes through an iterative boosting process, especially when fine-tuned.

Rather than committing to a single approach, the current version of my system evaluates both models every day and chooses the one with the better performance based on:

  • R² Score – how well the model explains the variance in actual energy production
  • Mean Absolute Error (MAE) – how far off the predictions are, on average, in kilowatt-hours

The evaluation and selection logic is handled automatically in Python using cross_val_score and basic comparison logic. Here’s an example:


    from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
    from sklearn.model_selection import cross_val_score
    import numpy as np
    
    # Train both models
    rf = RandomForestRegressor(n_estimators=1000, random_state=42)
    gb = GradientBoostingRegressor(n_estimators=500, learning_rate=0.05, random_state=42)
    
    rf_scores = cross_val_score(rf, X, y, scoring='neg_mean_absolute_error', cv=5)
    gb_scores = cross_val_score(gb, X, y, scoring='neg_mean_absolute_error', cv=5)
    
    rf_mae = -np.mean(rf_scores)
    gb_mae = -np.mean(gb_scores)
    
    # Choose the better model
    best_model = rf if rf_mae < gb_mae else gb
    

This flexible model selection mechanism allows my system to adapt to changing weather conditions, seasonal shifts, or unexpected anomalies. By always choosing the better of the two, I achieve maximum predictive accuracy with minimal retraining overhead.

4. Training: Teaching the Model Using Past Data

Once the most suitable algorithm was selected (in this case, Random Forest Regressor), the model was trained on the complete historical dataset. The goal of this training phase is for the model to learn the underlying relationships between daily weather patterns and solar energy production over time.

To achieve the best performance, a structured training workflow was followed:

  • Data Split: The dataset was divided into training and test sets to evaluate the model’s performance on unseen data.
  • Cross-Validation: A k-fold cross-validation approach was used, where the dataset was split into multiple folds (usually 5), allowing the model to be trained and validated repeatedly across different segments of data. This helps avoid overfitting and ensures more reliable performance estimates.
  • Hyperparameter Optimization: Using GridSearchCV from scikit-learn, a systematic search was performed across various combinations of model parameters to find the optimal settings.

After testing a wide range of parameter combinations, the following configuration was found to provide the best balance between accuracy and generalization:

  • n_estimators = 1000 – the number of decision trees in the forest
  • max_depth = 20 – limits how deep each tree can grow, preventing overfitting
  • min_samples_leaf = 4 – the minimum number of data points required in a leaf node, which ensures that trees don't become overly specific to rare patterns

With this setup, the model was trained on more than 1,300 days of carefully prepared energy and weather data. The resulting performance, as measured on the test set as of 19 April 2025, was:

  • R² Score: 0.8 – This indicates a very strong correlation between the predicted and actual values. An R² of 1.0 represents a perfect prediction, so a score above 0.95 is considered excellent in real-world forecasting scenarios.
  • Mean Absolute Error (MAE): 4.47 kWh – On average, the model's predictions deviate from the actual daily energy production by only 4.47 kWh. Given a daily range that often spans 10 to 40 kWh, this is a relatively small margin of error.

The combination of a strong R² score and low MAE confirms that the model has successfully captured the complex relationships between environmental conditions and solar performance. It is now capable of making highly accurate day-ahead forecasts based on weather input alone.

5. Forecast Generation: How Today’s Prediction Is Made

Once the machine learning model has been trained on historical data, it can be used for daily forecasting. Each morning, the system automatically pulls up-to-date weather forecast data from the Open-Meteo API. This data includes the predicted weather parameters for the current and upcoming days, formatted as a JSON object.

Here’s an example of the forecast input for a single day:


          {
            "date": "2025-04-19",
            "temp_max": 20.1,
            "temp_min": 10.5,
            "cloud_cover": 0.3,
            "radiation": 6.2,
            "wind_speed": 5.7,
            "precipitation": 0.0
          }
          

The Python script reads this JSON object and transforms it into a feature vector that matches the structure of the model’s training input. For example:


          X = [20.1, 10.5, 0.3, 0.0, 5.7, 6.2]
          

This feature vector is then passed to the trained RandomForestRegressor, which analyzes the input and returns a predicted energy yield for the day — in this example, 51.29 kWh.

Once the prediction is made, it is saved in a structured output file named forecast.json. This file serves as the machine-readable record of daily and future forecasts, used by dashboards and visualization tools.

Example output:


          {
            "production": [
              { "date": "2025-04-19", "forecast": 51.29 },
              { "date": "2025-04-20", "forecast": 48.73 }
            ]
          }
          

This process runs automatically on a schedule (e.g., every night or early morning), ensuring that the forecast is always based on the latest available weather data. The system can also be configured to update more frequently throughout the day if new weather data becomes available.

By combining high-resolution weather forecasts with a trained prediction model, the system provides a near real-time estimation of expected solar production, supporting everything from household energy planning to battery charging strategies and EV scheduling.

6. Comparing Forecast to Reality

A forecast is only as good as its accuracy — which is why comparing predicted values to actual outcomes is a critical part of the system. At the end of each day, the system receives a new upload of measured solar production from the inverter. This data is stored in a structured JSON file named fronius.json, which contains the daily energy totals recorded by the inverter.

Here’s an example of the structure:


          {
            "daily": {
              "2025-04-19": { "total": 23.68 },
              "2025-04-18": { "total": 26.10 }
            }
          }
          

This file is automatically generated by the inverter’s data logging system and uploaded daily, ensuring that actual energy values are available soon after sunset. These values can then be matched against the corresponding forecast stored in forecast.json.

By comparing the forecasted and actual production, the dashboard can display how accurate the prediction was. For example, if the forecast for April 19 was 51.29 kWh and the actual production turned out to be 23.68 kWh, then:


          Accuracy = 23.68 / 51.29 ≈ 0.462 → 46% of predicted energy was produced
          

This deviation could be due to unforeseen changes in weather conditions, such as unexpected cloud cover or haze, which were not captured by the weather forecast. Such discrepancies help users:

  • Gauge the reliability of short-term forecasts
  • Diagnose model underperformance on specific weather patterns
  • Improve future forecasting through model retraining with updated data

On the dashboard, these comparisons are visualized to give users an immediate understanding of how well the AI model performed. Trends over time — such as overestimation during stormy weeks or underestimation during clear spells — can also be monitored to fine-tune model performance and expectation.

Ultimately, this feedback loop closes the circle: the model is trained on historical data, used to forecast future energy production, and then evaluated daily against real-world performance — creating a system that is both data-driven and self-correcting over time.

7. Visualization and Live Updates

Forecasts and actual energy production data are brought to life through an interactive web dashboard that provides a clear and intuitive overview of solar performance. The interface is designed to give users a real-time sense of how their solar installation is performing — both in absolute terms and relative to predictions.

The dashboard displays data using line and bar charts for both short- and long-term perspectives:

  • Line Charts: These show daily forecasts alongside actual production, allowing users to visually compare predicted vs. measured output over time (e.g., the past 7 or 30 days).
  • Bar Charts: Used for summarizing monthly totals, enabling a quick overview of seasonal patterns and longer-term trends.

To enhance the sense of immediacy, the dashboard is configured to refresh automatically every 15 minutes. This ensures that any newly uploaded data — whether from the weather forecast or the inverter’s measurements — is reflected almost instantly. Behind the scenes, the page reads the latest JSON files (such as forecast.json and fronius.json) and redraws the graphs without requiring a manual reload.

In addition to the charts, the dashboard features progress bars that provide a visual snapshot of the day’s current performance. These show how much energy has been produced so far relative to the forecast, updating live throughout the day. For example:


          Forecast: 38.0 kWh   |   Current: 19.3 kWh   →   50.8% complete
          

This simple but effective visualization gives users a real-time sense of whether they are on track to meet the day’s solar potential — especially useful for timing household energy use, charging electric vehicles, or evaluating the effects of sudden weather changes.

All charts and indicators are designed to be mobile-friendly and kiosk-ready, making them ideal for use on dedicated home information screens, wall-mounted tablets, or full-screen displays on a smart TV or Raspberry Pi.

In short, the dashboard is more than just a data viewer — it’s a dynamic, real-time energy companion that makes the performance of our solar installation easy to understand and act upon.

8. Why This Is Better Than Pure GTI Models

Traditional solar forecasting often relies on GTI models (Global Tilted Irradiance), which estimate energy yield based on the theoretical amount of solar radiation striking a tilted panel surface. These models use calculations derived from sun angles, panel orientation, and location-specific irradiance data, and are excellent for predicting ideal-case scenarios under clear skies.

However, GTI-based models have a major limitation: they assume perfect conditions. While they can calculate how much energy should be generated, they do not account for the actual behavior of our system or its real-world environment. Specifically, GTI models tend to overlook:

  • Shading effects from trees, buildings, or chimneys, which reduce output regardless of theoretical irradiance.
  • Temperature derating, where panel efficiency drops during high heat — especially relevant in sunny but hot climates.
  • Inverter clipping, which limits output once DC input exceeds the inverter's rated AC capacity.
  • Dust, dirt, or snow on panels, which may significantly reduce real production even when GTI is high.

This is where the AI model comes in. Trained on actual, historical production data from our own installation, the AI learns the complex, non-linear relationships between weather, GTI, and real energy output — including all the local quirks and physical constraints that affect our system.

Over time, the AI adapts to the unique characteristics of our environment, taking into account how our panels respond to varying levels of cloud cover, wind, humidity, and seasonal angles. This allows it to make highly personalized, data-driven forecasts that reflect how our specific setup behaves — not how an idealized model predicts it should.

However, GTI data still plays a crucial role. It offers a stable and consistent reference baseline for incoming solar energy, helping the AI anchor its understanding of what's theoretically possible. GTI-based forecasts also remain valuable for estimating production in locations without existing performance data, or for long-term projections.

That’s why the best solution isn’t AI or GTI — it’s a hybrid approach that combines both:

  • GTI provides the physical model of expected irradiance based on atmospheric and solar geometry.
  • AI adjusts the forecast using actual past performance, weather effects, system constraints, and historical behavior.

This synergy allows us to benefit from both scientific accuracy and , resulting in forecasts that are not just theoretically correct — but truly reflective of what our solar system will deliver under today’s actual conditions.

9. Limitations and Future Work

While the current AI-driven forecasting system provides impressive results and significantly outperforms most rule-based or GTI-only approaches, it remains a work in progress. Like any machine learning application, its accuracy and adaptability can be enhanced through further refinement and expansion.

Several opportunities for improvement have already been identified:

  • Hourly Forecasting: At present, the model provides only daily production estimates. Introducing hourly resolution would allow users to anticipate generation peaks and troughs throughout the day — ideal for optimizing battery charging, EV scheduling, or dynamic consumption strategies.
  • More Frequent Retraining: The model could be retrained weekly or even daily using newly collected inverter data. This would help it stay aligned with recent system behavior, seasonal shifts, hardware degradation, or cleaning events.
  • Satellite-Based Cloud Detection: Integrating satellite imagery or cloud cover maps could significantly improve short-term accuracy, particularly in variable weather conditions. These sources can detect fine-grained shading patterns and distinguish between high and low cloud layers that affect irradiance differently.
  • Advanced Feature Enrichment: Adding new features such as humidity, dew point, atmospheric pressure, solar zenith angle, and even snow cover could allow the model to react more intelligently to environmental factors that affect production.
  • Model Ensemble Techniques: Combining multiple forecasting models (e.g., Random Forest + Gradient Boosting + Neural Networks) in an ensemble could yield more resilient and stable forecasts across all weather types.
  • Confidence Estimation: Future versions of the model may also produce a confidence interval alongside each prediction (e.g., “Expected: 32.4 kWh ± 2.1”), providing users with a sense of how uncertain a forecast is on any given day.

Despite these known limitations, the model has already demonstrated high reliability, adaptability, and accuracy, routinely outperforming static models, rules-of-thumb, and purely GTI-based estimates. Its ability to learn from real-world data and continuously refine itself makes it a powerful tool for both casual monitoring and serious energy planning.

As development continues, the system will evolve from a reactive forecasting tool into a truly intelligent assistant — capable of anticipating, adapting, and helping optimize energy decisions with minimal user intervention.