Rating Prediction Project

Data Processing, Model Training, and Prediction

This project tackles the challenge of predicting item ratings within a complex retail environment, where product shipments and a comprehensive master dataset are combined to form a rich foundation for analysis. By seamlessly integrating disparate data sources—spanning inventory details, dispatch records, and item attributes—this system provides a unified view of each product’s lifecycle.

Once the data is properly prepared, cleaned, and consolidated, the project leverages Machine Learning models to classify or predict the final ratings of these items. This predictive capacity offers insights into product performance, helping stakeholders make strategic decisions about inventory distribution, promotional activities, and overall product line management.

The workflow is structured into three core scripts:

  1. data_preparation.py – Gathers, cleans, and merges raw data from multiple sources.
  2. data_processing.py – Focuses on feature selection, splitting data into training/testing sets, and building preprocessing pipelines.
  3. model.py – Utilizes different algorithms (e.g., Neural Networks, Decision Trees, KNN) to train predictive models and generate item rating predictions.

Through this integrated process, the project aims to streamline data handling while uncovering patterns that guide business actions more effectively.

1. data_preparation.py


# data_preparation.py

# Libraries and Warnings
from sqlalchemy import create_engine
import pandas as pd
import numpy as np
import warnings
from unidecode import unidecode

warnings.filterwarnings("ignore")

# SQL Class for Database Queries
class SQL():
    def __init__(self, cnn_str) -> None:
        self.engine = create_engine(cnn_str)

    def tabla_des(self):
        """
        Query shipping (despachos) data from the database.
        """
        query = """
            SELECT * 
            FROM shipments_table
            WHERE ...;
        """
        return pd.read_sql(query, self.engine)

def process_table(df):
    """
    Cleans and standardizes columns like 'precio', 'color', 'estetica', etc.
    Converts string prices to numeric, normalizes text, etc.
    """
    df['precio'] = df['precio'].str.replace('$', '', regex=True)
    df['precio'] = df['precio'].str.replace(',', '', regex=True).astype(float)
    # Additional transformations: unify text, categories, colors, silhouettes...
    return df

def load_data(mes_exh=False, columns_to_remove=None):
    """
    Loads the items master data from a CSV, applies 'process_table',
    and optionally adds a 'fecha_exhibicion' column. 
    Removes specific columns if 'columns_to_remove' is given.
    """
    items_df = pd.read_csv(r"items_master.csv", sep='\t', encoding='UTF-8')
    items_df = process_table(items_df)

    if mes_exh:
        items_df['fecha_exhibicion'] = mes_exh

    if columns_to_remove:
        for col in columns_to_remove:
            if col in items_df.columns:
                items_df.drop(col, axis=1, inplace=True)
    return items_df

def main():
    """
    1) Creates a database connection and fetches shipping data.
    2) Loads and processes the master items data.
    3) Merges both DataFrames on 'item' and sorts by 'fecha_exhibicion'.
    4) Adds item rating info from an Excel file, then returns the final DataFrame.
    """
    # Example connection string (replace with your actual credentials)
    cnn_str_dw = 'postgresql://username:password@host:port/database'
    sql_conn = SQL(cnn_str_dw)
    
    # Get shipping data
    shipments_df = sql_conn.tabla_des()

    # Master items
    items_df = load_data()
    
    # Merge shipping + items
    merged_df = items_df.merge(
        shipments_df, 
        on='item', 
        how='inner'
    ).sort_values(by=['item', 'fecha_exhibicion'])

    # Read rating data from an Excel file
    rating_df = pd.read_excel("item_rating.xlsx").iloc[:, :2]
    rating_df.columns = ['item', 'calificacion']

    # Merge rating info
    final_df = merged_df.merge(rating_df, on='item', how='inner')

    # Convert 'fecha_exhibicion' to month
    final_df['fecha_exhibicion'] = pd.to_datetime(final_df['fecha_exhibicion'])
    final_df = final_df[final_df['fecha_exhibicion'] > '2020-01-01']
    final_df['fecha_exhibicion'] = final_df['fecha_exhibicion'].dt.month.astype(object)

    # Set 'item' as index
    final_df.set_index('item', inplace=True)
    
    return final_df
    

Explanation:

  • SQL runs queries to fetch shipping data.
  • process_table normalizes text and cleans prices.
  • load_data loads a CSV of item master data and optionally applies extra transformations.
  • main merges the processed items with shipping info and their respective ratings.
The result is a fully prepared DataFrame that other scripts can use.

2. data_processing.py


# data_processing.py

from sklearn.model_selection import train_test_split
from sklearn.compose import make_column_selector, ColumnTransformer
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.pipeline import Pipeline
from data_preparation import main

class DataProcessor:
    @staticmethod
    def data_final(gender=None, category=None, rs=10):
        """
        1) Calls 'main()' from data_preparation to get the merged DataFrame.
        2) Optionally filters by 'gender' and/or 'category'.
        3) Splits data into training and test sets.
        4) Returns X_train, X_test, y_train, y_test, plus the complete X and y.
        """
        df = main()

        if gender:
            df = df[df['genero'] == gender]
        if category:
            df = df[df['categoria'] == category]

        # Features & target
        X = df.drop(['calificacion', 'genero', 'categoria'], axis=1)
        y = df['calificacion']

        X_train, X_test, y_train, y_test = train_test_split(
            X, y, test_size=0.3, random_state=rs
        )
        return X_train, X_test, y_train, y_test, X, y

    @staticmethod
    def create_pipeline(cat_encoder, model, num_encoder):
        """
        1) Uses ColumnTransformer to encode categorical & scale numerical columns.
        2) Applies SelectKBest for feature selection.
        3) Finally, uses the chosen 'model' for classification.
        """
        pipeline = Pipeline([
            ('preprocessing', ColumnTransformer(
                transformers=[
                    ('cat_encoding', cat_encoder, make_column_selector(dtype_include='object')),
                    ('num_scaling', num_encoder, make_column_selector(dtype_include='number'))
                ],
                remainder='passthrough'
            )),
            ('feature_selection', SelectKBest(score_func=chi2)),
            ('classification', model)
        ])
        return pipeline

    @staticmethod
    def plot_confusion_matrix(cm, target_names, title='Confusion matrix', cmap=None, normalize=False):
        """
        Plots a confusion matrix, optionally normalized, showing accuracy & misclassification rates.
        Useful to interpret model performance on each class.
        """
        # Full code for plotting the matrix, setting labels, titles, etc.
        pass
    

Explanation:

  • data_final fetches the prepared data from data_preparation.py, optionally filters it, and splits it into training and test sets.
  • create_pipeline builds a pipeline that includes: categorical encoding, numerical scaling, feature selection (using chi2), and a final classification step.
  • plot_confusion_matrix is a helper function to visualize the matrix of errors.

3. model.py


# model.py

from sklearn.neural_network import MLPClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.preprocessing import OneHotEncoder, MinMaxScaler
import pandas as pd
import time

from data_processing import DataProcessor

class ModelTrainer:
    def __init__(self, gender, exhibition_month):
        """
        Initializes three ML classifiers: 
          1) Neural Network (MLP)
          2) Decision Tree
          3) K-Nearest Neighbors
        Receives 'gender' and 'exhibition_month' for data filtering & labeling.
        """
        self.classifiers = [
            MLPClassifier(max_iter=1000, random_state=5),
            DecisionTreeClassifier(random_state=5),
            KNeighborsClassifier(n_neighbors=40)
        ]
        self.gender = gender
        self.exhibition_month = exhibition_month
        self.dict_pred = {}

    def train_models(self, X, y, item_list=None, test_mode=False):
        """
        1) Creates pipelines for each classifier using DataProcessor.create_pipeline.
        2) Trains each model on X, y.
        3) Predicts on a subset of items if 'item_list' is provided.
        4) Saves results to CSV.
        """
        for clf in self.classifiers:
            pipeline = DataProcessor.create_pipeline(
                cat_encoder=OneHotEncoder(),
                model=clf,
                num_encoder=MinMaxScaler()
            )
            pipeline.fit(X, y)

            if item_list:
                # This function would load or filter the items to predict
                X_items = self.load_items_for_prediction(item_list, test_mode)
                self.dict_pred[clf.__class__.__name__] = pipeline.predict(X_items)

        # Save dictionary to CSV or manipulate further
        # ...
        return self.dict_pred

    def load_items_for_prediction(self, items, test_mode):
        """
        If 'test_mode' is True, it reads from a testing sheet.
        Otherwise, it filters the main data by the given item list.
        """
        # Implementation that fetches or filters the relevant items
        # ...
        return X_selected

# Example usage
if __name__ == "__main__":
    X_train, X_test, y_train, y_test, X, y = DataProcessor.data_final(gender='hombre', category='camisas')
    trainer = ModelTrainer(gender='hombre', exhibition_month=5)
    trainer.train_models(X_train, y_train, item_list=['itemA','itemB'])
    

Explanation:

  • ModelTrainer defines and manages the three classifiers: Neural Network, Decision Tree, and K-Nearest Neighbors.
  • train_models uses pipelines (built in DataProcessor) to train each classifier and optionally predict on a specific list of items.
  • load_items_for_prediction is a helper to fetch or filter data for new items you want to predict.

🚀 How to Run the Project

To get started with this rating prediction project, follow these steps:

  1. Install Dependencies:
    Use the requirements file to install everything you need:
    pip install -r requirements.txt
  2. Execute Each Script in Order:
    1. python data_preparation.py – Prepares and merges dataset information.
    2. python data_processing.py – Splits data into training and testing sets, applies pipelines.
    3. python model.py – Trains the chosen models and generates predictions.
  3. Adjust Project Parameters:
    Within model.py, you can customize options (e.g., gender, category, or exhibition_month) to meet your data and modeling needs.

📦 Project Dependencies

This project relies on the following key libraries and packages:

  • scikit-learn – Machine Learning pipelines and algorithms
  • pandas – Data manipulation and analysis
  • numpy – Numerical operations
  • sqlalchemy – SQL database interaction
  • openpyxl – Excel file reading
  • matplotlib – Data visualization (optional)

📊 Model Evaluation with a Confusion Matrix

I used a confusion matrix to assess each model’s performance. By comparing predictions to actual labels, you can see exactly where and how often the model is correct or incorrect. This reveals which classes need more attention or if certain patterns are being misclassified.