As a data analyst working in a tech company, I was tasked with developing a scoring system to rank our business partners worldwide based on past performance and future potential. The goal was to support planning targets and budget allocation, ensuring that resources were directed toward the most valuable partners. At the time, we lacked a predictive model, so my approach was to create a structured scoring framework that allowed us to categorize partners efficiently and guide strategic decisions. While this method provided immediate clarity, I later realized it introduced biases that limited its long-term effectiveness.
What began as a straightforward categorization exercise became a deeper lesson in the limitations of rigid and oversimplified classification. This experience reinforced a broader truth in data analytics: categories are helpful starting points for structuring information and communicating insights, but they can also create a fixed worldview. They give us a sense that this is how things are, rather than how someone decided to organize the world. As John Maynard Keynes put it, “The difficulty lies, not in the new ideas, but in escaping from the old ones.” Categories shape how data is interpreted, often discriminating and distorting reality. True progress comes from breaking free of their constraints. While my initial model provided much-needed structure where none existed [1], over time it risked reinforcing outdated categorical assumptions and missing critical insights [2]. Transitioning toward a more fluid and complex approach [3] helped us to mitigate bias, improve decision-making, and position the company for smarter partner management.
1. Building the Partner Scoring System
My initial approach was straightforward. The scoring system was designed to evaluate and segment partners using a multi-factor weighted approach, incorporating key performance indicators (KPIs) that reflected both historical success and potential future value. The framework consisted of several components:
- Revenue Contribution Score: Measured the total revenue generated by the partner over the past 24 months. Higher revenue correlated with higher scores, but diminishing returns were applied to avoid over-favoring historically strong but stagnant partners.
- Deal Velocity and Consistency: Evaluated how quickly and frequently the partner closed deals. This was normalized to account for variations in deal size and industry verticals.
- Deal Size Elasticity: Captured the partner’s ability to adapt to different deal sizes. This helped surface partners with versatile selling capabilities.
- Engagement Score: Assessed interaction levels with the company, such as participation in joint marketing campaigns, training sessions, and partner enablement programs.
- Growth Trajectory: Included signals like year-over-year revenue growth, pipeline development, and increasing sales velocity. This was an early attempt to introduce forward-looking elements into the model.
- Strategic Alignment: A qualitative factor determined by sales and partner development teams, reflecting whether the partner was investing in new initiatives aligned with company goals.
Each component was assigned a weighting factor, allowing the model to compute a final partner score. These scores were then used to segment partners into three primary categories:
- High Performers [Score > 80]: Strong past performance and promising future potential.
- Stable Performers [Score between 50-80]: Consistently contributing but lacking aggressive growth signals.
- Underperformers [Score < 50]: Inconsistent or declining engagement.
This categorization helped leadership quickly assess which partners to prioritize. It allowed for better allocation of resources, incentive programs, aligning sales efforts and future investments with the most promising partner relationships. Given the lack of a predictive framework at the time, this categorical approach provided structure where none existed.
2. The Pitfalls of a Categorical Approach
We thought categorically in this project for a good reason: to get a quick and easy understanding. At first, the model seemed effective, but over time, I began noticing its flaws. The alluring oversimplifications actually built bias and distortions. Some questions should have been part of our decision-making mantra: Were the categories valid? Were they useful?
Arbitrary cutoffs. The first hidden danger was the reliance on strict thresholds to dictate decisions. If a partner just missed a target whether for revenue, deal velocity, or engagement, they would be assumed to have a problem. If they just met the target, everything was considered fine. Even though the two cases’ numbers were almost identical, one would have to make organizational changes while the other partner would just carry on with business as usual. This arbitrary threshold would cleary imped learning from data in these two close cases; and a tiny and fundamentally meaningless difference could thus lead to a dramatically different decision and possibly the wrong one. This was particularly risky in cases where investment decisions were binary. A partner would be excluded from a new sales initiative simply because they scored 49.8 instead of 50. A funding program required a certain engagement threshold, leading to partners who scored just under the cutoff being denied essential support, despite only a marginal difference. The illusion of precision was misleading cause it was an arbitrary boundary that decided fate, rather than a true reflection of performance.
Amplification bias. Another glaring issue was amplification bias. Minor variations in scores resulted in disproportionately large consequences due to categorical cutoffs. A partner scoring 79.9 was categorized as “Stable,” while another scoring 80.1 was labeled a “High Performer.” Despite their near-identical performance, the difference in classification meant vastly different levels of support and investment like we’ve said earlier. In practice, this meant that a high performer would receive premium sales support, larger co-marketing budgets, and inclusion in exclusive strategy sessions. Conversely, a stable performer, though nearly identical in performance, would receive significantly less attention, limiting their potential to improve. I was afraid that over time, this would exacerbate the gap between partners rather than reflecting their true capabilities. Those categorized as high performers would have more opportunities to grow, while others would struggled not because they would be inherently weaker, but because they would lack the same level of support. This self-fulfilling prophecy meant that once a partner was categorized lower, they had fewer chances to break out of that category, reinforcing the illusion that the scoring system was accurate when, in reality, it was driving the outcome.
Categorical rigidity. Another major issue was the rigidity of categories, which failed to account for fluctuations in performance. A partner might face temporary setbacks due to market shifts, leadership changes, or external disruptions, yet the model treated them as if their performance was static. For instance, when the pandemic disrupted business operations back in 2019, some adapted quickly, others struggled. The scoring system failed to account for these externalities, and penalized partners whose performance declined, treating the dip as an inherent weakness rather than a temporary setback. Indeed, some partners who struggled in 2020 actually rebounded in 2021, eventually outperforming those labeled in the model as high performers.
Predictive blind spots. Because the model was mostly designed around historical performance, it failed to recognize emerging opportunities. This created two key blind spots. First, we overvalued established partners and undervalued new partners. Long-time partners who had once been strong received consistently high scores, even when their growth potential had plateaued. Conversely, newer partners, despite showing strong leading indicators, scored low simply because they lacked sufficient history. This bias toward the status quo meant that promising partners were systematically undervalued, delaying strategic investment in up-and-coming players. Over time, this misallocation of resources could hurt innovation. Some rising stars who could have been fast-tracked for high growth would be left struggling, while legacy partners coasting on past success would continue to dominate the rankings.
Overfitting and underfitting. Humans are wired to find patterns, even where none exist. For that reason, my categorical scoring system also suffered from both overfitting and underfitting, which contributed to biases by distorting the accuracy and reliability of the categorization. Overfitting occurred when we forced partners into categories that didn’t truly make sense, while underfitting happened when we ignored real performance predictors in favor of broad, surface-level classifications. A prime example was our segmentation of partners into high-growth or low-growth cohorts based on regional economic trends. We assumed that partners operating in fast-growing economies such as tech hubs in North America and Western Europe would naturally outperform those in slower-growing regions. This assumption overfitted our model to economic conditions rather than actual partner capabilities, leading us to prioritize investments in partners based on location targets and partner managers’ territories rather than sales execution. Meanwhile, we underfitted by failing to capture the real variables that drove success, such as executive buy-in, sales specialization, and marketing investment. By focusing solely on categories and not recognizing complexity and common ground, we would miss opportunities.
Perception Bias. By labeling partners, we inadvertently shaped perception and reinforced stereotypes that influenced relationships. In my first model, all the partners assigned to one of our partner managers in India were labeled as underperformers. This insight, if left unchallenged, would have significantly impacted his past-year bonus and upcoming quota, making it a highly sensitive issue. In reality, many of her partners were AI specialists with high engagement and strong client retention, but their deals took longer to close than other partners in the region due to the complexity of AI implementation cycles. However, because they were labeled as underperformers by the model, leadership hesitated to invest in them, assuming they were not worth the effort. This misclassification was a result of underfitting in the model’s scoring formula. The model placed excessive weight on short-term revenue and deal velocity, assuming that faster deal cycles indicated stronger performance. By failing to account for longer, high-value AI sales cycles, it oversimplified partner success and penalized partners whose strengths lay in high retention and strategic, complex deals. When the partner manager in India learned about the classification, she pushed back strongly, stating that they were not underperforming, that they just operated differently, and that the model was penalizing her for things that were actually strengths in her partners’ business models. The partner manager felt that she and her partners were being intentionally deprioritized, while our team initially assumed the partners were making excuses for weak sales. In reality, the model failed to account for different success factors across business models, treating longer deal cycles as a weakness rather than a characteristic of high-value AI sales. The label itself became an obstacle, creating an “us vs. them” dynamic that damaged collaboration. By imposing rigid categories without room for context, we unintentionally fostered division and resentment, making it harder to build trust and productive partnerships.
3. Evolving the Model from Static to Probabilistic
While we had identified six strong reasons to transition our methodology, refining the model ultimately required a fundamental shift in mindset. I had to become more comfortable with uncertainty, nuance, and complexity. I also needed to be willing to revise results based on new evidence in order to make the model more adaptable. Rather than viewing partners as static entities, we began treating performance as a spectrum. To improve the system, I recognized the need to move from a categorical to a continuous scoring approach. Instead of rigid classifications, I introduced a weighted scoring model that would account for multiple variables dynamically. This transition involved five key steps.
Evaluating Categorical Validity. The analytics of segmentation had been outsourced to me when I joined the partner team. As a newcomer, I initially relied on predefined partner categories, historical business trends rather than recently validated predictive data patterns, which led to misclassification pitfalls over time. To ensure the segmentation was more meaningful and data-driven, I applied clustering techniques to evaluate whether partners naturally grouped into the predefined categories or if the model had been forcing arbitrary divisions. To assess the validity of these clusters, I applied silhouette scores which measured how similar a data point was to its own cluster compared to other clusters, providing an indication of how well-separated the segments were. A score close to +1 indicated that data points were well-matched to their assigned cluster and far from others, suggesting a clear and meaningful segmentation. A score near 0 indicated overlapping clusters, meaning that the segmentation was weak and potentially arbitrary. Finally, a negative silhouette score suggested misclassified data points, where observations would be better placed in a different group. For instance, when testing the Middle East cohort category in my first model iterations, its low silhouette score investigation revealed that these partners had high revenue variability and inconsistent engagement levels, meaning they did not share a uniform business trajectory. Similarly, the partner tail from the same region had a negative silhouette score, suggesting that some partners labeled as underperformers were actually more similar to stable performers than truly failing partners. The lack of clear separation suggested that historical revenue alone was an insufficient classifier and that other parameters such as engagement levels and sales velocity should have been factored.
Introducing more qualitative data. We integrated as much qualitative data as possible to capture dimensions that numerical models might miss such as partner feedback, customer satisfaction, and strategic market positioning. Partner feedback revealed hidden obstacles. The AI-focused partner from India that we introduced earlier, misclassified as an underperformer due to long deal cycles, actually had justified longer implementation cycles. Instead of reducing support, we adjusted the qualitative metrics score to rebalance his final score. Customer satisfaction was another critical factor absent from the original model. An Italian partner had lower revenue than its peers but was consistently receiving exceptional Net Promoter Score (NPS) ratings and repeated business from big clients. Despite moderate short-term financial performance, their ability to retain and expand existing accounts made them a high-value partner in the long run, leading us to increase support for account-based marketing initiatives. Market influence was another dimension that numbers alone failed to capture. A partner in the Asia-Pacific region was considered as underperforming but was actually a recognized-enough thought leader, frequently speaking at high-value and emerging industry events and influencing high-profile customers. By recognizing their brand authority and network strength, we repositioned them as a strategic alliance partner rather than evaluating them solely on sales metrics. By incorporating these non-traditional performance indicators, we caught early indicators of long-term success, avoided misclassification mistakes, and refined our investment approach to better support partners with untapped strategic potential.
Embrace Probabilistic Scoring. Rather than using a strict categorical ranking, I introduced probability-based scoring that estimated a partner’s likelihood of future success based on multiple data points. I incorporated predictive factors such as early-stage pipeline growth, customer retention and win rate. Then, using Python, I leveraged machine learning techniques such as logistic regression and random forest models to compute a probability score for each partner, representing their likelihood of exceeding a defined performance threshold. I chose logistic regression because it is well-suited for binary classification problems and in this case, determining whether a partner would surpass a performance benchmark or not. This model estimated the probability that an outcome will occur based on a set of independent variables, making it ideal for a scenario where we need a structured, interpretable method to rank partners. Before training the model, I standardized the data by selecting, scaling and processing the predictive variables to remove inconsistencies, ensuring the model could make stable predictions. Logistic regression then assigned a probability score to each partner, helping us determine the likelihood of their future success in a way that was easy to interpret and justify. However, since real-world data was rarely linear, I also introduced random forest to capture non-linear relationships and interactions between variables. Unlike logistic regression, which assumed a direct correlation between inputs and outcomes, random forest built multiple decision trees to remove noise and detect patterns that a linear model might miss. For instance, a high customer retention rate only drove revenue growth if paired with an expanding early-pipeline. Additionally, random forest provided feature importance rankings, identifying the most influential success factors. Logistic regression provided a structured probability baseline, while random forest uncovered hidden patterns, making predictions more robust. By combining both models, I created a comprehensive probability score. This helped us differentiate partners more precisely without artificial cutoffs.
Decision Flexibility. Rather than making rigid “go/no-go” decisions based on categories, we scaled investments proportionally based on partner probability scores. We implemented a tiered investment strategy, where decisions were made using probability confidence intervals rather than static thresholds. For instance, instead of classifying a partner as a “High Performer” at 80+ points, we allocated incentives based on confidence bands. This tiered model allowed us to scale resources intelligently and proportionally instead of applying an all-or-nothing approach.
- >90% confidence of high performance → Full strategic investment.
- 70-89% confidence → Moderate support and growth enablement.
- 50-69% confidence → Limited investment with ongoing monitoring.
- <50% confidence → Minimum engagement unless improvement signals appear.
Monitor Continuously. Instead of performing quarterly or annual reviews, we transitioned to a real-time performance monitoring system that dynamically adjusted partner scores based on incoming data. We implemented automated dashboards that weekly ingested data from CRM systems and partner manager qualitative and quantitative inputs. Hence scores were recalculated monthly based on the latest deal velocity, pipeline movement, and engagement metrics. Instead of retroactively analyzing performance, we anticipated shifts in partner trajectories by continuously tracking trends, enabling proactive adjustments in investment rather than relying on static labels and reacting too late.
Explore more
To access another perspective on the topic, read the article published Harvard Business Review from Bart de Langhe and Philip Fernbach (2019): “The dangers of categorical thinking”.