Rating Prediction Project
Data Processing, Model Training, and Prediction
This project tackles the challenge of predicting item ratings within a complex retail environment, where product shipments and a comprehensive master dataset are combined to form a rich foundation for analysis. By seamlessly integrating disparate data sources—spanning inventory details, dispatch records, and item attributes—this system provides a unified view of each product’s lifecycle.
Once the data is properly prepared, cleaned, and consolidated, the project leverages Machine Learning models to classify or predict the final ratings of these items. This predictive capacity offers insights into product performance, helping stakeholders make strategic decisions about inventory distribution, promotional activities, and overall product line management.
The workflow is structured into three core scripts:
data_preparation.py
– Gathers, cleans, and merges raw data from multiple sources.data_processing.py
– Focuses on feature selection, splitting data into training/testing sets, and building preprocessing pipelines.model.py
– Utilizes different algorithms (e.g., Neural Networks, Decision Trees, KNN) to train predictive models and generate item rating predictions.
Through this integrated process, the project aims to streamline data handling while uncovering patterns that guide business actions more effectively.
1. data_preparation.py
# data_preparation.py
# Libraries and Warnings
from sqlalchemy import create_engine
import pandas as pd
import numpy as np
import warnings
from unidecode import unidecode
warnings.filterwarnings("ignore")
# SQL Class for Database Queries
class SQL():
def __init__(self, cnn_str) -> None:
self.engine = create_engine(cnn_str)
def tabla_des(self):
"""
Query shipping (despachos) data from the database.
"""
query = """
SELECT *
FROM shipments_table
WHERE ...;
"""
return pd.read_sql(query, self.engine)
def process_table(df):
"""
Cleans and standardizes columns like 'precio', 'color', 'estetica', etc.
Converts string prices to numeric, normalizes text, etc.
"""
df['precio'] = df['precio'].str.replace('$', '', regex=True)
df['precio'] = df['precio'].str.replace(',', '', regex=True).astype(float)
# Additional transformations: unify text, categories, colors, silhouettes...
return df
def load_data(mes_exh=False, columns_to_remove=None):
"""
Loads the items master data from a CSV, applies 'process_table',
and optionally adds a 'fecha_exhibicion' column.
Removes specific columns if 'columns_to_remove' is given.
"""
items_df = pd.read_csv(r"items_master.csv", sep='\t', encoding='UTF-8')
items_df = process_table(items_df)
if mes_exh:
items_df['fecha_exhibicion'] = mes_exh
if columns_to_remove:
for col in columns_to_remove:
if col in items_df.columns:
items_df.drop(col, axis=1, inplace=True)
return items_df
def main():
"""
1) Creates a database connection and fetches shipping data.
2) Loads and processes the master items data.
3) Merges both DataFrames on 'item' and sorts by 'fecha_exhibicion'.
4) Adds item rating info from an Excel file, then returns the final DataFrame.
"""
# Example connection string (replace with your actual credentials)
cnn_str_dw = 'postgresql://username:password@host:port/database'
sql_conn = SQL(cnn_str_dw)
# Get shipping data
shipments_df = sql_conn.tabla_des()
# Master items
items_df = load_data()
# Merge shipping + items
merged_df = items_df.merge(
shipments_df,
on='item',
how='inner'
).sort_values(by=['item', 'fecha_exhibicion'])
# Read rating data from an Excel file
rating_df = pd.read_excel("item_rating.xlsx").iloc[:, :2]
rating_df.columns = ['item', 'calificacion']
# Merge rating info
final_df = merged_df.merge(rating_df, on='item', how='inner')
# Convert 'fecha_exhibicion' to month
final_df['fecha_exhibicion'] = pd.to_datetime(final_df['fecha_exhibicion'])
final_df = final_df[final_df['fecha_exhibicion'] > '2020-01-01']
final_df['fecha_exhibicion'] = final_df['fecha_exhibicion'].dt.month.astype(object)
# Set 'item' as index
final_df.set_index('item', inplace=True)
return final_df
Explanation:
SQL
runs queries to fetch shipping data.process_table
normalizes text and cleans prices.load_data
loads a CSV of item master data and optionally applies extra transformations.main
merges the processed items with shipping info and their respective ratings.
2. data_processing.py
# data_processing.py
from sklearn.model_selection import train_test_split
from sklearn.compose import make_column_selector, ColumnTransformer
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.pipeline import Pipeline
from data_preparation import main
class DataProcessor:
@staticmethod
def data_final(gender=None, category=None, rs=10):
"""
1) Calls 'main()' from data_preparation to get the merged DataFrame.
2) Optionally filters by 'gender' and/or 'category'.
3) Splits data into training and test sets.
4) Returns X_train, X_test, y_train, y_test, plus the complete X and y.
"""
df = main()
if gender:
df = df[df['genero'] == gender]
if category:
df = df[df['categoria'] == category]
# Features & target
X = df.drop(['calificacion', 'genero', 'categoria'], axis=1)
y = df['calificacion']
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.3, random_state=rs
)
return X_train, X_test, y_train, y_test, X, y
@staticmethod
def create_pipeline(cat_encoder, model, num_encoder):
"""
1) Uses ColumnTransformer to encode categorical & scale numerical columns.
2) Applies SelectKBest for feature selection.
3) Finally, uses the chosen 'model' for classification.
"""
pipeline = Pipeline([
('preprocessing', ColumnTransformer(
transformers=[
('cat_encoding', cat_encoder, make_column_selector(dtype_include='object')),
('num_scaling', num_encoder, make_column_selector(dtype_include='number'))
],
remainder='passthrough'
)),
('feature_selection', SelectKBest(score_func=chi2)),
('classification', model)
])
return pipeline
@staticmethod
def plot_confusion_matrix(cm, target_names, title='Confusion matrix', cmap=None, normalize=False):
"""
Plots a confusion matrix, optionally normalized, showing accuracy & misclassification rates.
Useful to interpret model performance on each class.
"""
# Full code for plotting the matrix, setting labels, titles, etc.
pass
Explanation:
data_final
fetches the prepared data fromdata_preparation.py
, optionally filters it, and splits it into training and test sets.create_pipeline
builds a pipeline that includes: categorical encoding, numerical scaling, feature selection (usingchi2
), and a final classification step.plot_confusion_matrix
is a helper function to visualize the matrix of errors.
3. model.py
# model.py
from sklearn.neural_network import MLPClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.preprocessing import OneHotEncoder, MinMaxScaler
import pandas as pd
import time
from data_processing import DataProcessor
class ModelTrainer:
def __init__(self, gender, exhibition_month):
"""
Initializes three ML classifiers:
1) Neural Network (MLP)
2) Decision Tree
3) K-Nearest Neighbors
Receives 'gender' and 'exhibition_month' for data filtering & labeling.
"""
self.classifiers = [
MLPClassifier(max_iter=1000, random_state=5),
DecisionTreeClassifier(random_state=5),
KNeighborsClassifier(n_neighbors=40)
]
self.gender = gender
self.exhibition_month = exhibition_month
self.dict_pred = {}
def train_models(self, X, y, item_list=None, test_mode=False):
"""
1) Creates pipelines for each classifier using DataProcessor.create_pipeline.
2) Trains each model on X, y.
3) Predicts on a subset of items if 'item_list' is provided.
4) Saves results to CSV.
"""
for clf in self.classifiers:
pipeline = DataProcessor.create_pipeline(
cat_encoder=OneHotEncoder(),
model=clf,
num_encoder=MinMaxScaler()
)
pipeline.fit(X, y)
if item_list:
# This function would load or filter the items to predict
X_items = self.load_items_for_prediction(item_list, test_mode)
self.dict_pred[clf.__class__.__name__] = pipeline.predict(X_items)
# Save dictionary to CSV or manipulate further
# ...
return self.dict_pred
def load_items_for_prediction(self, items, test_mode):
"""
If 'test_mode' is True, it reads from a testing sheet.
Otherwise, it filters the main data by the given item list.
"""
# Implementation that fetches or filters the relevant items
# ...
return X_selected
# Example usage
if __name__ == "__main__":
X_train, X_test, y_train, y_test, X, y = DataProcessor.data_final(gender='hombre', category='camisas')
trainer = ModelTrainer(gender='hombre', exhibition_month=5)
trainer.train_models(X_train, y_train, item_list=['itemA','itemB'])
Explanation:
ModelTrainer
defines and manages the three classifiers: Neural Network, Decision Tree, and K-Nearest Neighbors.train_models
uses pipelines (built inDataProcessor
) to train each classifier and optionally predict on a specific list of items.load_items_for_prediction
is a helper to fetch or filter data for new items you want to predict.
🚀 How to Run the Project
To get started with this rating prediction project, follow these steps:
-
Install Dependencies:
Use the requirements file to install everything you need:pip install -r requirements.txt
-
Execute Each Script in Order:
python data_preparation.py
– Prepares and merges dataset information.python data_processing.py
– Splits data into training and testing sets, applies pipelines.python model.py
– Trains the chosen models and generates predictions.
-
Adjust Project Parameters:
Withinmodel.py
, you can customize options (e.g., gender, category, or exhibition_month) to meet your data and modeling needs.
📦 Project Dependencies
This project relies on the following key libraries and packages:
scikit-learn
– Machine Learning pipelines and algorithmspandas
– Data manipulation and analysisnumpy
– Numerical operationssqlalchemy
– SQL database interactionopenpyxl
– Excel file readingmatplotlib
– Data visualization (optional)
📊 Model Evaluation with a Confusion Matrix
I used a confusion matrix to assess each model’s performance. By comparing predictions to actual labels, you can see exactly where and how often the model is correct or incorrect. This reveals which classes need more attention or if certain patterns are being misclassified.