Mastering the Art of Building Analytical Datasets: The Key to Unlocking AI/ML Excellence
The creation of the analytical dataset is a crucial stage in the AI/ML lifecycle, embodying the principle “garbage in, garbage out.” An analytical dataset comprises the inputs for your AI/ML model, including the target variable (Y) for supervised learning. This dataset is pivotal as it directly influences the model’s quality and effectiveness. The upcoming section will delve into essential considerations for building, cleaning, and transforming data to construct your first analytical dataset. This foundational work is vital for hypothesis testing and feature selection, setting the stage for developing your initial model version.
Data requirements should be tied directly to defined user stories or MECE hypothesis trees to clarify what data is needed and why.
Say a hospital network is looking to leverage AI/ML to enhance patient care and optimize hospital operations. The goal is to reduce patient readmission rates and improve the efficiency of resource allocation across its facilities.
One way to lay out requirements is by creating a data requirements document listing our requirements like Patient medical history, diagnosis information, and previous admissions. Another way is by first developing user stories & hypotheses and tying data requirements back, which makes the request easier to understand. Let's explore that in further detail below:
User Story: As a hospital administrator, I want to predict patient readmission risks, implement targeted interventions, and reduce readmission rates.
Hypothesis Tree:
- Hypothesis 1: Patients with certain chronic conditions are more likely to be readmitted.
- Data needed: Patient medical history, diagnosis information, previous admissions.
Hypothesis 2: Readmission rates are higher for patients discharged on weekends or holidays.
- Data needed: Discharge dates and times, readmission records.
Hypothesis 3: Post-discharge follow-up improves patient outcomes and reduces readmissions.
- Data needed: Follow-up appointment records, patient outcome measures
As you can see, this approach provides a lot more clarity as to the why & what behind the ask.
Ensure IT and stakeholders understand the ‘why’ behind data requests to avoid missing essential data and its interconnections.
The specific value behind this is that by providing IT and data Stewards with the bigger picture, they will pro-actively guide you in how to connect all the information, which will become valuable for building your overall dataset.
To build an AI/ML model, you must first identify which features have the most predictive power. To that end, you need to go through multiple rounds of hypothesis testing, feature engineering, and feature selection. It’s best to create an analytical dataset to do all this upfront rather than do testing across disaggregated datasets
Say you have two hypotheses:
- Patients with certain chronic conditions are more likely to be readmitted.
- Readmission rates are higher for patients discharged on weekends or holidays.
Chances are the data to test the two hypotheses comes from different datasets. You could aggregate them or test them using separate datasets. There are two reasons why you should aggregate the dataset.
- Consistency in Data Handling: Using one dataset ensures that the data processing, cleaning, and transformation steps are consistent across all variables. This uniformity reduces the risk of discrepancies or biases from handling datasets separately.
- Cross-Hypothesis Insights: Investigating both hypotheses together may uncover insights that would be missed if the datasets were analyzed separately. For instance, the impact of discharge timing might be different for patients with chronic conditions compared to those without, which could lead to a more nuanced understanding and targeted interventions.
To generate business insights that can be adopted, you need to understand the level at which your model and the data are required to be.
Let's use the patient's re-admission use case. You can create a dataset at a patient level or department level. Both are valid options, but it's largely through a dataset at the patient level that the model can accurately identify factors contributing to readmissions for individual patients. This allows the hospital to implement targeted interventions, like personalized follow-up care plans, to prevent readmissions, thus improving patient outcomes and reducing costs.
If the data were only at a more aggregated level, such as the department or ward level, it would be difficult to identify specific risk factors for individual patients, leading to less effective readmission prevention strategies.
Conversely, if the data were too granular, focusing on minute-by-minute monitoring data for all patients could overwhelm the model with irrelevant information, making it less efficient and harder to extract meaningful insights.
As you understand data sources in further detail, make sure to update the architecture and data flow diagrams that you built as part of your scoping exercise.
People are generally visual learners, and when you update the architecture & data flow diagrams, it becomes easier for people to track requirements (what data to gather), ideate on tasks ( what pipelines need to be developed, how data is to be joined), and develop acceptance criteria (what does end state look like). Tracking all of this simply through a Kanban Board or Jira ticket is a very cumbersome process which leads to challenges if it’s not complemented by an effective visual.
When your raw data is pulled from tables, it passes through various storage units. Understanding what these are is essential to building a robust solution.
Data Lake: This is a central repository that allows you to store all your structured and unstructured data at any scale. Data pulled from various sources is often stored here first. Purpose: The data lake acts as a staging area where data can be kept in its raw form until it’s needed for analysis. This setup allows for greater flexibility in handling and processing data.
Integration with Cloud Platforms:
- In Google Cloud Platform (GCP), data can be moved from a data lake (like Google Cloud Storage) to analytics services like BigQuery for analysis, or to AI Platform for machine learning tasks.
- In Amazon SageMaker, you might store your data in Amazon S3 (Simple Storage Service) as your data lake, then import it into SageMaker for machine learning model training and analysis.
- For SAS, data can be imported from various sources, including data lakes, into the SAS environment for advanced analytics, data management, and business intelligence.
When you’re ready to analyze your data, it’s often brought into a temporary storage or processing environment that is part of the analytics or machine learning platform you’re using. This could be:
- An instance in Google Cloud’s BigQuery, a powerful data warehouse tool for analysis.
- A notebook instance or data processing job in Amazon SageMaker.
- A workspace in SAS where data is loaded for analysis and model building.
Understand the importance of primary keys in dataset construction and how to identify them.
In a dataset, a primary key is a specific column or set of columns that uniquely identifies each table row. Take the re-admissions use case as an example.
Each patient’s record in the dataset must be unique to avoid duplicating or mixing up information. A primary key (like a patient ID) ensures that each record is distinct so the AI/ML model can accurately track and analyze each patient’s admission and readmission events.
The primary key allows for the linkage of different data sources or tables. For example, patient ID can link a patient’s demographic information, medical history, and readmission records. This linkage is crucial for a comprehensive analysis of factors influencing readmissions.
Recognize whether your data is structured, semi-structured, or unstructured, and understand the implications for processing and analysis.
Structured Data: This is like data in a neat Excel spreadsheet, where everything has its place in rows and columns. Examples include names, dates, or addresses. Because of its organized format, structured data is easy to enter, store, search, and analyze. For example, in a customer database, each customer’s information (like name, age, and address) would be in separate columns.
Since structured data is so organized, it’s straightforward to use in databases, making it easier to input, query, and analyze. This makes tasks like data pipelining (moving data from one system to another), storage, and model development more efficient and less costly.
Semi-structured Data: This type of data isn’t as neat as structured data but still has some organization. Think of emails where the format isn’t as rigid: they have standard fields (like “To,” “From,” and “Subject”), but the body can be free-form text. Other examples include JSON and XML files.
Semi-structured data requires more work than structured data because it’s not in a strict table format. However, it’s still easier to work with than unstructured data because it has some level of organization. Tools that can handle variability in data formats are needed for processing and analysis.
Unstructured Data is like a big, messy closet where everything is thrown in. Examples include video files, images, and free-form text documents. This data doesn’t fit neatly into tables, making it harder to organize and analyze.
Unstructured data is the most challenging to work with because it requires more advanced techniques and technologies for processing and analysis. It takes up more storage space, and extracting useful information often involves complex processes like natural language processing (NLP) for text, image recognition for photos, and data mining techniques.
Handling Strategy for semi-structured data
- Parsing: Use software tools to read and extract the useful parts of the data, like pulling contact info from an XML customer data file.
- Transformation: Convert the data into a more structured format, like a table, so it’s easier to analyze. For example, a JSON file of tweets can be turned into a spreadsheet where each row is a tweet and columns represent different tweet attributes like the date, text, and user.
Handling strategy for unstructured data:
Text Data (Natural Language Processing or NLP):
- Tokenization: Break down text into smaller pieces, like sentences or words, to understand and analyze it better.
- Sentiment Analysis: Determine the emotion or opinion expressed in the text, like determining whether a product review is positive or negative.
- Keyword Extraction: Identify the most important words or phrases in a text to get a quick sense of what it’s about.
Image Data (Computer Vision):
- Image Recognition: Use AI to identify and classify objects in photos, like distinguishing between images of cats and dogs.
- Feature Extraction: Convert images into a form that a computer can process (like numbers or vectors) to analyze patterns or features in the visual data.
Understand the difference between time series, cross-sectional, and panel data and the implications it has for your analysis.
Cross-Sectional Data: Think of taking a snapshot of a population at a single time, like a survey of voter preferences during an election year. This data type captures various characteristics of subjects at one moment without considering time.
With cross-sectional data, the focus is on comparing different entities at the same point in time. The analysis might be simpler than with panel or time series data, but it doesn’t capture changes over time, limiting its use for trend analysis.
Time Series Data: This is like a timeline of events or measurements, such as daily stock market prices. Time series data is collected at regular intervals, focusing on the temporal sequence of data points.
Analyzing time series data involves looking for trends, patterns, and cyclical behaviors over time. This requires specific techniques like forecasting and trend analysis, which can be crucial for predicting future events based on past patterns.
Panel Data: Imagine tracking the same students’ grades across several years. Panel data combines features of time series data (data points collected over time) and cross-sectional data (data collected at a single point in time but across multiple subjects or entities). It’s like having a detailed record of multiple entities over time.
Panel data allows for more complex analysis, such as examining changes over time within and across groups. It requires more sophisticated approaches to extract the results specifically.
Based on the requirements, understand whether you need to work with real-time data or batch data and what implications that has for your data engineering and model build.
Real-time Data: This is like having a continuous flow of information, where data is processed immediately as it comes in, similar to a live traffic update system. Real-time processing allows for instant analysis and decision-making.
- Impact on Data Engineering: Requires a robust infrastructure that can continuously stream data without lag or downtime. Technologies like Apache Kafka or Spark Streaming are often used.
- Impact on Model Build: Models must be lightweight and efficient to make quick predictions. They might be updated frequently as new data arrives, necessitating automated, ongoing training processes.
Batch Data: This involves collecting data over a period and then processing it all at once, like calculating the average daily temperature from hourly readings at the end of each day. Batch processing is suitable when immediate responses are not critical.
- Impact on Data Engineering: The infrastructure can be simpler and less costly than real-time systems. Data is processed in large chunks at scheduled times using tools like Apache Hadoop or batch processing features in cloud services.
- Impact on Model Build: Models can be more complex and computationally intensive since they don’t need to produce instant outputs. Training can occur on a less frequent schedule, using larger, consolidated datasets.
Create a data dictionary if one doesn’t exist to aid in data understanding and future use. Ensure SME validates it before moving forward.
I once spent 2–3 weeks on an analysis because my turnover #s weren’t syncing up with what was in a master report. It turns out that was because the column I used to identify when an employee turned over followed a different definition from the one the report drew from.
Access to data dictionaries would have helped resolve this issue, but none were available. In any case, we created one for both data sets to move the process forward so that other teams could move past similar challenges at greater speed.
Perform standard data quality checks for outliers, nulls, and inconsistent data types. Collaborate with stakeholders for validation.
Outliers: These are data points that are significantly different from the majority of the data. They can indicate variability in your data, experimental errors, or even true anomalies.
Basic Checks: Use methods like z-scores, IQR (Interquartile Range), or box plots to identify data points far from the norm. Plotting data using scatter plots or histograms can help visually spot outliers.
Nulls are missing or undefined values in your dataset. If not handled properly, null values can skew your analysis and lead to incorrect conclusions.
Basic Checks: Determine the number of null values in each dataset column. Assess the impact of nulls on your analysis and decide whether to impute (fill in), remove, or ignore them based on the context.
Inconsistent Data Types refer to data that doesn’t match the expected type, like alphabetical characters in a numeric field. Inconsistent data types can cause errors in data processing and analysis.
Basic Checks: Type Verification: Ensure each column in your dataset matches the expected data type (e.g., numbers, strings, dates).Format Consistency: Check that the data within each column follows a consistent format, especially for dates and categorical data.
Establish a clear process for handling missing or null values, balancing statistical methods with business logic and stakeholder input:
Suppose you have a dataset containing daily sales figures for a retail store, and you’re using this data to forecast future sales. You notice that on some days, sales data (null values) are missing. Standard Imputation Approaches might suggest filling these missing values with the mean, median, or mode of the sales data. However, this approach may not always be appropriate because it doesn’t consider the context or reason for the missing data.
For example, if sales data are missing for specific days because the store was closed for renovation or due to a holiday, using the average of surrounding days would not accurately reflect the store’s sales pattern. Imputing with a standard method could lead to misleading analysis, affecting the accuracy of the sales forecast.
A business stakeholder, such as the store manager, might inform you that the store was closed for a week for renovation, and therefore, it would not be accurate to impute sales data for that period based on historical averages. Instead, the stakeholders might suggest setting sales to zero for those days or using data from the same period in previous years to estimate the impact of the closure on sales.
Conduct basic analysis to validate data quality and understand its characteristics before diving deep into complex models.
To understand if your data is correct, you should be able to run some quick analysis and recreate some basic reports to validate that the underlying data is correct. It’s advised to do this; otherwise, you may find yourself very deep into the model-building process only to realize you have issues with your underlying data quality that should have been caught earlier.
Standardization and Normalization: Understand the concepts and their importance in preparing data for analysis.
Standardization is about adjusting the values in your dataset so they have a common scale. It’s like converting temperatures from Fahrenheit to Celsius so everyone understands the temperature similarly. In technical terms, it involves subtracting the mean of your data and dividing by the standard deviation, resulting in data centered around zero with a standard deviation of one.
When it’s important: You standardize data when comparing features on different scales or units. For example, if you look at height and weight together, these two are measured in completely different units (inches and pounds). Standardization makes these different measurements comparable.
Normalization, on the other hand, adjusts your data so that the range is scaled down to fit within a specific range, often between 0 and 1. It’s like adjusting scores from different tests to a common scale to see which scores are better, regardless of the test’s total points.
When it’s important: Normalization is key when your data needs to have a bounded range, especially for methods that rely on the length of the vectors in your data, like k-nearest neighbors (KNN) or neural networks.
Understanding Data Distribution: Using basic analysis to comprehend data distribution, aiding hypothesis testing and model building.
Before you dive into hypothesis testing or building predictive models, you should first understand the distribution of your data. This step can guide you in selecting the appropriate statistical tests and algorithms that align with your data’s characteristics. There are three approaches to use
- Visual Inspection: Use plots like histograms, box plots, or Q-Q plots to visually assess the distribution of your data.
- Descriptive Statistics: Look at measures like mean, median, mode, range, variance, and standard deviation to understand your data’s central tendency and spread.
- Statistical Tests: Perform tests like the Shapiro-Wilk test to check if your data is normally distributed or use skewness and kurtosis measures to understand the degree of asymmetry and weakness in your data distribution.
If you want an example of what happens when you don’t do this check, let's take a standard retail company.
The company assumes that customer spending is normally distributed (a bell curve), so it calculates the average spending and bases its offers around this figure. But what if the spending data is not normally distributed but heavily skewed, with a few high spenders (outliers) and many low spenders?
Using the average in this skewed distribution could misrepresent what customers typically spend. As a result, the special offers might be too high for most customers, leading to unnecessary expenditure for the company and not effectively targeting the customer base.
Recognize and address class imbalance to improve model accuracy and prevent biased outcomes.
Imagine you have a fruit basket with apples and oranges, but there are 95 apples and only 5 oranges. If you’re trying to build a machine that sorts the fruit and teaches it using this basket, it might get good at recognizing apples (because it sees them a lot) but not so good at recognizing oranges. In data science, this is a “class imbalance” — you have more of one class (apples) and fewer of another (oranges).
If you don’t address class imbalance, your model might become biased towards the majority class (apples in our example). This means it might often predict the majority class even when the minority class is correct, simply because it hasn’t seen enough of the minority class to learn about it properly.
Use machine learning algorithms inherently designed to handle class imbalance or adjust the existing algorithms to give more weight to the minority class.
Ensure data consistency across related fields, like verifying chronological order in date fields.
Consider a situation where you’re filling out a form with start and end dates for an event. Cross-field validation would involve ensuring the start date is before the end date. It’s not just about checking that each date is valid; it’s about ensuring they make sense together.
Feature Engineering: Learn the basics of creating features significantly influencing model performance.
Feature engineering uses domain knowledge to select, modify, or create new features from raw data to increase the predictive power of machine learning models. Using the hospital’s re-admissions use case, let's take a couple of examples of feature engineering.
Convert continuous data, such as length of stay, into categories (short, medium, long) if the model performs better with categorical data. Take several treatments per day & group them into low, medium, and high-intensity treatments to see if a higher treatment intensity correlates with readmissions.
What you are seeing here is that the intensity of treatment or length of stay may have useful predictive power, but in their original form of 1,2,3,4,5,6,7,8, 9,10 days or treatments, they are not as conducive to prediction as short stay, medium stay, or long stay.
There are some general rules of thumb regarding the amount of data required for effective model training and validation.
- More complex models (like deep neural networks) usually need more data to learn effectively without overfitting (memorizing the data too closely).
- If your model is like a sophisticated machine with many parts (features and parameters), it needs a lot of data to train each part properly.
- The more varied or diverse your data, the more samples you will need to capture all the possible variations.
- If you’re studying something with many different behaviors or characteristics (like predicting customer behavior across different regions), you need enough data to cover all these variations.
- The level of accuracy or precision you need can dictate how much data is required. Higher accuracy typically requires more data to validate the model’s predictions.
- If your model needs to make precise predictions (like in medical diagnoses), you’ll need a large amount of data to ensure the predictions are reliable and validated.
- You need enough data to train the model and validate and test it. This usually means dividing your data into sets for training, validation, and testing.
- Ensuring you have enough data to validate and test the model means setting aside some of your total data, so you must collect enough to accommodate this split.
Overall, a good idea is, to begin with a smaller data set to see how the model performs, then incrementally add more data until the performance plateaus or meets your objectives.
Know the frequency of data updates, data collection methods, and data management processes.
You would be surprised at how often people build highly useful models only to find the data they need to refresh the model won’t be available at the cadence necessary, thus making the model less useful than intended.
Follow protocols set by your compliance, legal, and data governance team to avoid risks.
Unless you are working at a startup, your IT, Legal, and Compliance teams should be able to identify sensitive data and how to handle it. Follow the rules and don’t skip steps, no matter how cumbersome they may seem.
Understand the process of creating data pipelines to automate data flow from source to destination.
Creating a data pipeline is like setting up a conveyor belt in a factory to move products from one place to another automatically. In the context of data, this “conveyor belt” is a series of steps that automatically move and process data from its source (where it comes from) to its destination (where it’s needed).
Here’s how it works in simple terms:
Collection: First, you gather data from various sources. This could be like picking up raw materials from different warehouse parts. These sources might be databases, websites, sensors, or other places where data is generated.
Transportation: Next, you move the data from where it was collected to where it can be processed. This is like the conveyor belt in our factory, carrying products to different stations. In data terms, this often involves transferring data over a network to a central location.
Processing: Once the data arrives at its processing location, it’s cleaned and transformed into a useful format. This is similar to factory workers refining raw materials or assembling parts into a product. Processing data can mean organizing it, fixing errors, or converting it into a format that’s easy for computers to handle.
Storage: After the data is processed, it’s stored where it can be easily accessed later, like storing finished products in a warehouse. In the data world, this storage can be a database or a data lake, depending on how the data will be used.
Here are two to three rules of thumb for deciding whether to start with flat files for your MVP and when to transition to more complex data pipelines:
Start with Flat Files If:
- Your data volume is low, and the processing needs are simple.
- You need to quickly validate your idea or model with minimal setup.
Switch to Data Pipelines When:
- Your data volume and complexity grow, making flat files cumbersome to manage.
- You need real-time processing, automation, or integration with other systems.
Consider Scalability Early:
- Even if starting with flat files, plan for future scaling to ensure a smooth transition to data pipelines when needed.
APIs for Data Acquisition: Learn how to set up APIs to pull data from different sources, facilitating real-time or periodic data updates.
APIs (Application Programming Interfaces) are used for data acquisition by providing a structured way to request and receive data from different sources. They act as a bridge between your application and external data sources, allowing for real-time or periodic data updates. Here’s how it works and how you can set up APIs for data acquisition:
You send a request to the API of a data source (like a weather service or social media platform), and in return, the API sends back the data you requested. APIs can provide real-time data, allowing your application to have up-to-date information whenever needed. APIs can be set up to automatically pull data at regular intervals, facilitating periodic updates without manual intervention.
Determine which external services or platforms have the data you need and whether they offer an API. You might need to register or request access to use the API. This often involves creating an account with the data provider and getting an API key, a unique identifier for authenticating your requests.
Use programming languages like Python and Java or tools designed for API integration to write scripts or applications that make requests to the API. Libraries like requests
Python can simplify this process. For periodic data updates, you can schedule your API requests using task schedulers (like cron jobs in Unix/Linux) or workflow automation tools (like Apache Airflow).
Once you receive the data, process it as needed (including parsing, cleaning, and transforming) and store it in your database or data warehouse for further use.
Building an analytical dataset is an iterative process that may change as building a version of model results.
Building an analytical dataset for a project like predicting patient readmissions in a hospital is indeed an iterative process. This means the dataset evolves as you refine your model and learn more about the data and what influences readmissions. Here’s how this iterative process can unfold in the patient readmissions use case:
Start with Basic Data: Initially, you might gather basic patient information, such as demographic details, medical history, and details of the current and past hospital stays.
After testing the initial model, you might find that certain factors, like the length of stay or specific diagnoses, are strong predictors of readmissions. Based on these insights, you add more detailed data to your dataset, such as laboratory test results, medication records, or notes from healthcare providers.
With the additional data, you can create more complex features, such as trends in lab results or the number of readmissions in the past year. Second Model Version: You build and test a new version of the model using the enriched dataset, which may show improved accuracy in predicting readmissions.
The process continues iteratively, with each cycle of model building, testing, and evaluation providing insights that guide further data collection, feature engineering, and model refinement. Feedback from clinicians and hospital administrators might also inform the dataset’s evolution, highlighting other factors that could influence readmissions, such as patient satisfaction or post-discharge support.