In Python, you can split data into training and testing sets using various libraries, with scikit-learn
being a popular choice. Here’s how you can do it:
- Using scikit-learn:

from sklearn.model_selection import train_test_split
# Assuming your data is in X (features) and y (labels) format
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Here, test_size is the proportion of the dataset to include in the test split,
# and random_state ensures reproducibility of the split.
- Using numpy:

import numpy as np
# Assuming your data is in X (features) and y (labels) format
indices = np.arange(X.shape[0])
np.random.shuffle(indices)
split_ratio = 0.8 # You can adjust this ratio as needed
split_idx = int(len(indices) * split_ratio)
train_indices, test_indices = indices[:split_idx], indices[split_idx:]
X_train, X_test = X[train_indices], X[test_indices]
y_train, y_test = y[train_indices], y[test_indices]
- Using pandas:

import pandas as pd
# Assuming your data is in a DataFrame df
from sklearn.model_selection import train_test_split
# Assuming your data is in X (features) and y (labels) format
X = df.drop('target_column', axis=1)
y = df['target_column']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
Remember to replace X
, y
, and target_column
with your actual data and target column name.
Choose the approach that suits your needs and the libraries you are already using. Each approach allows you to split your data into training and testing sets, which is crucial for developing and evaluating machine learning models.