Necessary cookies are absolutely essential for the website to function properly. These cookies ensure basic functionalities and security features of the website, anonymously.

The Messy Truth about Working with Real Data
Many new data professionals begin their journey with a structured and well-organised view of what data work entails. They might expect that datasets are always clean, questions are always clear and projects follow a logical, linear path from start to finish. This perception, often shaped by academic exercises and online tutorials, rarely holds up in real-world settings.
In practice, working with real data presents a very different set of challenges and understanding these is crucial for anyone aspiring to build a career in data science or analytics.
Real Data is Rarely Clean
Most educational datasets are curated for teaching. They are complete, correctly formatted and free from irregularities. However, real data in industry is frequently messy and inconsistent. It is common to encounter missing values, often the result of human error, system limitations, or changes in data collection processes. Dates, currency and categorical variables may appear in a wide variety of formats across the same dataset, creating confusion and inconsistencies. Duplicate or corrupted entries can creep in when data is pulled from merged sources or automated systems. In addition, some variables may be ambiguous or undocumented, making it unclear what they represent or how they were calculated.
Addressing these issues is not just a preliminary step; it is a core part of the analytical process. Tools such as OpenRefine provide specialised functions for identifying and correcting these inconsistencies. With its intuitive interface, OpenRefine allows users to cluster similar but non-identical entries, apply transformations across entire columns and explore data variations that might otherwise be missed. For those working programmatically, Pandas in Python and dplyr in R offer comprehensive libraries for filtering, transforming and reshaping data.
Crucially, understanding data provenance, which involves knowing where data comes from, how it was generated and what its limitations are, is a foundational competency. Without this understanding, even the most sophisticated analysis can lead to misleading or invalid conclusions.

Defining the Right Questions is Part of the Job
Real-world data projects rarely begin with clearly defined objectives. Stakeholders may express broad goals like wanting to "understand customer behaviour" or request a "dashboard that shows everything." These vague directives lack the specificity needed for effective analysis.
One of the most valuable skills a data professional can develop is the ability to transform these broad ideas into precise, actionable analytical questions. This process begins with consulting with colleagues to clarify the business objectives, identifying what decisions the analysis will support. It continues with understanding the intended audience, considering who will use the results and in what context. Finally, the scope must be refined based on what data is available and what can be realistically measured.
This translation from business language to analytical framing requires not only technical knowledge but also strong communication, listening skills and critical thinking. Without this foundation, even a technically correct analysis may fail to deliver value. Although educational programmes are starting to address this skill set, it often remains underemphasised.
Data Projects are Iterative, Not Linear
It is a common misconception that data projects proceed in a tidy, linear sequence: gather the data, clean it, analyse it and present the findings. In reality, the process is rarely this straightforward. As work progresses, new questions emerge, assumptions are challenged and data issues often surface late in the timeline, prompting revisions or restarts.
This iterative nature of data work means that flexibility and adaptability are as important as precision. Tools that support reproducibility and transparency can be especially helpful. For instance, Jupyter Notebooks and RMarkdown enable analysts to document their steps and present analysis in a way that is easy to follow and replicate. Version control systems, such as Git, allow teams to track changes, collaborate on code and maintain a historical record of project development. Tools like dbt (Data Build Tool) add another layer of organisation by allowing structured, modular transformations with built-in documentation and testing.
Adopting elements of Agile project management can also help manage the non-linear nature of data work. Iterative development cycles, stakeholder feedback loops and prioritised backlogs all enhance responsiveness and reduce the risk of misaligned outcomes.
Educational Implications
For those preparing to enter the data profession, internalising these insights can lead to more effective and meaningful learning. It helps set realistic expectations and encourages a more holistic approach to your skill development.
Aspiring analysts and scientists are encouraged to work with real-world datasets, such as those available from Kaggle, data.gov.uk, or academic repositories. These datasets offer a messier but more realistic experience than most training materials. Time should also be spent on data cleaning and documentation, using tools like OpenRefine and Pandas to get hands-on practice with real problems.
Beyond technical practice, learners should also become comfortable with exploratory data analysis. This includes learning how to derive meaningful questions from raw data, rather than relying on pre-defined problems. Engaging in project-based work, whether through internships, personal projects, or volunteer initiatives, is another effective way to build practical experience. Our IoA Learning hub has lots of projects and training courses too, available as part of IoA membership. Finally, explaining one's process and findings to peers or mentors through blogs, presentations, or collaborative reviews reinforces understanding and builds essential communication skills.
Conclusion
Real-world data science is complex, iterative and often messy. But that complexity is also what makes the field intellectually rewarding and impactful. As the profession evolves, those who embrace the ambiguity, refine their problem-solving approach and continuously learn will thrive.
For newcomers, it's not about having all the answers. It's about developing the mindset and methods to work through the questions that matter, even when the path forward isn't always clear.
Looking for an answer to improve your communication and data cleaning skills? All IoA members have access to our comprehensive hands-on training courses and structured digital professional portfolio platform to build and showcase evidence of your skills throughout your career. Find out about the right membership package for you here.
Join the IoA and prepare for the future of business
Sign up now to access the benefits of IoA Membership including 1400+ hours of training and resources to help you develop your data skills and knowledge. There are two ways to join:
Corporate Partnership
Get recognised as a company that works with data ethically and for investing in your team
Click here to joinIndividual Membership
Stand out for your commitment to professional development and achieve the highest levels
Click here to join