Cleaning Data before the load -

Preparing Legacy System Data for Modern Analytics

In today’s data-driven world, migrating data from legacy systems to modern platforms is not just a technical upgrade—it’s a strategic imperative. Yet, one of the most overlooked but critical phases of data migration is data cleaning before the load.

At Beacon Data Analytics, we believe that “Garbage In = Garbage Out” isn’t just a cliché—it’s a reality. No matter how powerful your new system is, if the data fed into it is flawed, the outputs and decisions based on it will be unreliable.

In this article, we explore why data cleaning is crucial before loading into new systems, what challenges legacy systems present, real-world examples, and the tools you can use—including Excel, Python, Power Query, Jupyter Notebooks, and PySpark—to ensure clean, accurate, and business-ready data.

Why Clean Data Before Loading?

Data migration is not a simple copy-paste job. Legacy systems often store decades of data in outdated formats, missing values, inconsistent entries, obsolete structures, and even hidden corruption. When migrating to new systems—be it cloud data warehouses, modern ERPs, CRMs, or advanced analytics platforms—this messy data can:

Break new data pipelines
Lead to incorrect analytics
Trigger compliance issues
Delay go-live timelines
Inflate costs through rework

Cleaning data before the load ensures that your new system begins with clean, trusted, and validated information.

Common Data Issues in Legacy Systems

Legacy systems—from old ERP solutions like SAP R/3 to in-house-built CRM or financial tools—present several challenges:

1. Duplicate Records

Multiple records for the same customer due to inconsistent name spelling or ID formats.
Example: John Smith, J. Smith, and Smith, John could all represent the same individual.

2. Missing or Incomplete Data

Empty fields for key business attributes like customer contact, product codes, or dates.
Example: Invoice records missing amounts or transaction dates.

3. Inconsistent Formats

Different date formats (e.g., MM/DD/YYYY vs YYYY-MM-DD) or phone number structures.
Currency inconsistencies due to lack of standardization.

4. Obsolete Values

Outdated codes, discontinued product IDs, or inactive clients still present in active records.

5. Free-text Fields & Unstructured Data

- Comments or descriptions that mix numbers and text, making automated processing difficult.

Step-by-Step: Cleaning Legacy Data Before Load

At Beacon, our structured approach involves several key steps:

1. Data Profiling

Goal: Understand what the data looks like.
Tools: Power Query, Python (pandas), Excel’s Data Analysis Toolpak.
What we check: Nulls, value ranges, unique values, data types.

2. Standardization

Goal: Make data uniform across fields.
Tools: Excel formulas, Python string functions, Power Query transformations.
Actions:
- Convert all date fields to a common format.
- Normalize case in text (e.g., all uppercase or proper case).
- Standardize units of measure (e.g., kg vs lbs).

3. De-duplication

Goal: Ensure each entity (e.g., customer, product) is unique.
Tools: Power Query’s “Remove Duplicates,” Python’s drop_duplicates(), PySpark’s dropDuplicates().

4. Validation

Goal: Check data against business rules.
Examples:
- Invoice amounts should be > 0
- Dates should not be in the future
Tools: Excel conditional formatting, Python validation scripts, SQL queries.

5. Mapping & Transformation

Goal: Align legacy fields with the target system schema.
Tools: Python dictionaries, Power Query joins, Excel VLOOKUP/XLOOKUP, PySpark mappings.

6. Enrichment (Optional)

Goal: Add value to data, such as appending geolocation, currency conversions, or industry tags.

7. Audit Trail & Logging

- Keep logs of cleaning steps for traceability.
- Tools like Python’s logging module or Jupyter Notebooks for notebooks-as-documentation.

Tools of the Trade: What We Use

Depending on the size, complexity, and format of your legacy data, we choose the right tool—or combination of tools—for the job:

🧮 Microsoft Excel

Best For: Smaller datasets, initial reviews, business rule definition.
Use Case: Quick wins like trimming spaces, identifying nulls, and human-readable reporting.

🧰 Power Query (Excel & Power BI)

Best For: Visual transformation pipelines, repeatable steps.
Use Case: Merge/join multiple sources, filter rows, pivot/unpivot legacy data.

🐍 Python (Pandas & Openpyxl)

Best For: Larger datasets, automation, complex logic.
Use Case: Write reusable scripts to clean millions of rows efficiently.

📓 Jupyter Notebooks

Best For: Documentation + Code + Results in one place.
Use Case: Collaborative cleaning efforts across teams with both technical and non-technical stakeholders.

🔥 PySpark

Best For: Big data processing (millions of rows).
Use Case: Cluster-based transformations before loading into data lakes, lakehouses, or modern cloud platforms like Azure, AWS, and Google Cloud.

Real-World Example: Legacy Finance System Migration

Client Profile:
A mid-sized financial services firm operating across Southern Africa, managing customer data, invoices, and product histories through a custom-built legacy system that had been in use for over 15 years.

📉 Challenges

The client faced multiple technical and business hurdles during the data migration project to Microsoft Dynamics 365:

Three inconsistent customer ID formats across regions (e.g., CUST-001, 001-C, and numeric-only versions), making record matching difficult.
20% of invoice records were missing critical fields, especially invoice dates and totals, due to years of inconsistent manual entries and system bugs.
Discontinued products were still appearing in open transactions, leading to potential compliance risks and inaccurate inventory reporting.
Data duplication across business units with slight naming variations caused bloated reports and unreliable segmentation.
Stakeholders had limited visibility into data quality due to siloed teams and lack of historical documentation.
Time pressure: The migration had to be completed within 8 weeks to align with a Dynamics 365 deployment deadline.

🛠️ Our Solution

Beacon designed and executed a robust data preparation and migration strategy:

🔹 Python Automation (pandas)

Developed reusable Python scripts using pandas to:
- Normalize and map customer ID formats into a unified schema
- Flag and replace missing invoice dates with calculated or inferred values (based on patterns in transaction history)
- Detect outlier invoice amounts and filter invalid records using statistical rules
- Validate foreign keys against master product and customer reference lists

🔹 Power Query Workflows

Used Power Query to build interactive workflows for:
- Fuzzy matching of duplicate customer names, even when spelled inconsistently
- Merging fragmented data sources from different business units
- Generating de-duplication reports and confidence scores for stakeholder approval

🔹 SME Collaboration in Excel

Built Excel-based interfaces for Subject Matter Experts (SMEs) to:
- Manually verify uncertain matches and mappings
- Provide domain knowledge (e.g., flagging which discontinued products should be reactivated vs ignored)
- Approve final versions of cleansed datasets before loading

🔹 Logging & Audit Trail

Implemented change logs and validation reports to maintain a full audit trail of all cleaning and transformation steps
Delivered clear documentation and Python notebooks for future reference and governance

✅ Result

99.7% clean dataset delivered to the Dynamics 365 team—validated and signed off by business and technical stakeholders.
Completed 3 weeks ahead of schedule, avoiding costly deployment delays.
Reduced ongoing data management issues by implementing reusable cleaning scripts and business logic templates.
Enabled the client to launch Dynamics 365 with confidence, supported by structured and trusted data.
Post-launch feedback from the CFO emphasized “a night-and-day difference in reporting accuracy and executive dashboard usability.”

💡 Key Takeaways

Legacy data is not a liability—it’s an opportunity to rebuild trust in analytics when approached strategically.
Combining automation (Python), self-service tools (Power Query), and business insight (Excel) yields scalable, repeatable success.
A well-planned data preparation strategy can compress timelines, improve migration outcomes, and reduce business risk.