The hardest part of production ML
is not the model. It is everything around it.
At CloudFountain, our sales team was spending significant time manually evaluating leads. They had a spreadsheet with scoring rules: industry type, company size, engagement level, source channel. It worked, but it was slow, inconsistent across team members, and could not adapt as patterns changed. The business question was straightforward: can we predict which leads are most likely to convert, and which are likely to get approved in the placement process?
This is the story of building that system using AWS SageMaker, from data preparation to production deployment, and what I would do differently the second time.
Phase 1: Data Preparation Was 70% of the Work
The raw data lived across three systems: our CRM (lead demographics and interaction history), the marketing platform (email engagement, ad attribution), and our internal database (application status, approval outcomes). Before any machine learning could happen, we needed a unified dataset.
Data Wrangler was excellent for the initial exploration phase: understanding distributions, identifying missing values, and prototyping feature transformations. But for the production pipeline, we built the ETL in Python scripts that ran on a schedule. Data Wrangler's visual interface does not version-control well and is difficult to reproduce across environments.
The raw fields from the CRM were not useful as-is. We engineered features like: days since last interaction, total number of touchpoints in the first 7 days, whether the lead came from a paid or organic channel, industry category mapped to historical conversion rates, and a binary flag for whether the lead responded to the first outreach within 48 hours. These engineered features improved model performance far more than any hyperparameter tuning.
Phase 2: Model Training with SageMaker Canvas and Custom Training
We started with SageMaker Canvas to get a baseline quickly. Canvas let the business stakeholders see preliminary results within hours, which was critical for buy-in. The no-code interface generated a classification model with reasonable accuracy, and more importantly, it showed the team that the data contained predictive signal.
For the production model, we moved to a SageMaker notebook running XGBoost. The reasons: we needed more control over feature preprocessing, we wanted to implement custom evaluation metrics (weighted F1 score aligned with business cost of false positives vs. false negatives), and we needed the training job to be reproducible in our CI/CD pipeline.
Accuracy was 87%, which sounds good. But the business cared more about two things: not wasting sales time on leads that would not convert (precision) and not missing high-quality leads (recall). We optimized for F1 score with a slight bias toward precision, because a false positive meant hours of wasted sales effort while a false negative meant one missed lead among many. This decision came directly from sitting with the sales team and understanding their workflow.
Phase 3: Deployment and Integration
The model deployed as a SageMaker real-time endpoint. Our Node.js backend calls this endpoint via the AWS SDK whenever a new lead enters the system. The response includes a score between 0 and 1 and the top contributing features, which we display in the CRM dashboard so sales representatives understand why a lead is ranked the way it is.
The sales team did not trust the model until they could see why it scored leads the way it did. We used SageMaker Clarify to generate feature importance explanations for each prediction. When a sales rep sees "scored high because: responded within 24 hours, industry: fintech, 4+ touchpoints in first week," they trust and act on the score. Without explainability, the model would have been ignored within a month.
Phase 4: Monitoring and Drift
Models degrade. Lead quality patterns shift with market conditions, new marketing campaigns change the incoming lead profile, and seasonal variations affect conversion rates. We set up SageMaker Model Monitor to track data drift and prediction distribution changes weekly.
In the first six months, we retrained twice: once when a new marketing channel started generating leads with a different profile than the training data, and once when the approval criteria changed internally. Both times, the monitoring alerts caught the drift before the sales team noticed degraded predictions.
What I Would Do Differently
Start with a simpler model. XGBoost was the right final choice, but we could have shipped a logistic regression model in week one to start collecting feedback while we iterated on features. The gap between "no model" and "simple model" is far larger than the gap between "simple model" and "tuned model."
Build the feedback loop from day one. We added the ability for sales reps to flag incorrect scores after launch. We should have built this before launch. That feedback data is now one of our most valuable inputs for retraining.
Machine learning in production is an engineering problem first and a data science problem second. The model is one component in a system that includes data pipelines, APIs, monitoring, and human feedback loops.
This project taught me that the value of ML in a business context is not about sophisticated algorithms. It is about building a system that reliably turns data into better decisions, and that the people using it actually trust and act on.