1. Introduction

This report embarks on an analysis of synthetic trad data, aiming to model time-dependent patters, volatility, and profitability in a controlled yet insightful manner. The primary objective is to explore the dynamics of simulated trade activity, uncovering recurrent trends, assessing risk through volatility modeling, and evaluating the potential for profit or loss based on the data provided. By delving into these aspects, we seek to understand how trading strategies might perform under specific conditions, even if those conditions are artificially constrained. This analysis, though rooted in synthetic data, aims to lay a foundation for broader applications or further investigations into real-world financial scenarios. The data under scrutiny comprises simulated trads, each accompanied by timestamps, prices ranging from 0 to 1000, quantities, and trad types—either Buy or Sell. This synthetic dataset provides a structured environment for testing hypothesis and modeling behaviors without the unpredictable noise of actual markets. For instance, the timestamps allow us to track the sequential nature of trades, while the prices and quantities offer a basis for calculating profitability and volatility. However, it’s crucial to recognize the data’s synthetic nature, meaning it’s generated rather than drawn from real financial exchanges, which shapes both its strengths and its limitations. Speaking of limitations, the prices in this dataset are artificially bounded between 0 and 1000, a constraint that simplifies analysis but deviates from the unbounded fluctuations seen in real markets. Additionally, the trade activity lacks real-world seasonality—the cyclical patters driven by economic events, holidays, or market hours that typically influence trading. These factors mean that while the analysis can reveal time-dependent trends and volatility characteristics, its findings may not fully mirror real-world outcomes. Nevertheless, the controlled setting of synthetic data enables a focused exploration of trade dynamics, free from external variables that might obscure underlying patters. In the following sections, we will outline the methodologies employed, present the findings from our analysis, and discuss their implications.

## 1. Average time between trades: 0.11 minutes

## 2. Number of unique days with trades: 366

## 3. Average number of trades per day: 13661.2

## 4. Average number of trades per week: 94339.62

## 5. Average number of trades per month: 416666.7

## 6. Aggregating to 30 -minute bins yields 17568 data points with a total of 5e+06 trades.

## 
## 7. Buy/Sell Price Statistics:

##    Average Buy Price: 500

##    Lowest Buy Price: 0

##    Highest Buy Price: 1000

##    Average Sell Price: 500

##    Lowest Sell Price: 0

##    Highest Sell Price: 1000

## 
## 8. Profit/Loss Calculation:

##    Total Profit/Loss: 301181211

##    The trades resulted in a net profit.

##    Average Trade Duration: 167.44 minutes

2. Trade activity Analysis

The synthetic trad data, despite its limitations, provides a rich ground for analysis. The frequency of trads is evident from the average time between trads, which is 0.11 minutes, with activity spanning over 366 unique days. The average number of trads per day reaches 13661.2, with 94339.62 trads per week, and an impressive 416666.7 per month, indicating a high volume of trading. When aggregating the data into 30-minute bins, we get 17568 data pints, representing a total of 5 million trads. Buy prices average at 500, with a lowest of 0 and a highest of 1000. Sell prices mirror this, also averaging at 500, with the same lowest and highest values of 0 and 1000. The total profit/loss stands at 301181211, showing a net profit. The average trade duration is 167.44 minutes, offering insight into the trading strategies’ effectiveness in this simulated market.

## NOTE: `period.apply(..., FUN = mean)` operates by column, unlike other math
##   functions (e.g. median, sum, var, sd). Please use `FUN = colMeans` instead,
##   and use `FUN = function(x) mean(x)` to take the mean of all columns. Set
##   `options(xts.message.period.apply.mean = FALSE)` to suppress this message.

## 
## Risk Metrics (VaR/CVaR):

##    Value at Risk (VaR) at 95% confidence: -0.0121

##    Conditional Value at Risk (CVaR) at 95% confidence: -0.0149

## Warning: `geom_vline()`: Ignoring `mapping` because `xintercept` was provided.
## `geom_vline()`: Ignoring `mapping` because `xintercept` was provided.

3. VAR/CVar

Value at Risk (VaR) at 95% Confidence: r round(var_95, 4) (1.21%). There is a 5% chance that daily losses will exceed 1.21% of the investment, providing a threshold for potential downside risk in this simulated market. Conditional Value at Risk (CVaR) at 95% Confidence: r round(cvar_95, 4) (1.49%). In the worst 5% of cases, the average loss is expected to be 1.49%, offering deeper insight into extreme loss scenarios beyond the VaR threshold. A histogram of daily returns with 50 bins, filled in blue with 70% opacity, and dashed red (VaR) and green (CVaR) lines. This plot visualizes the distribution of daily returns, with the VaR (-0.0121) marking the 5th percentile of losses and CVaR (-0.0149) indicating the average loss in the worst 5% of cases. It aids in understanding the risk profile, highlighting the left tail of potential losses.

## 
## Clustering Trades:

## Cluster Summary:

## # A tibble: 5 × 5
##   Cluster mean_price std_price mean_quantity sum_quantity
##     <int>      <dbl>     <dbl>         <dbl>        <dbl>
## 1       1       500.      289.         1002.   1003738491
## 2       2       500.      289.         5003.   4998631822
## 3       3       500.      289.         3003.   3007781594
## 4       4       500.      289.         9003.   8978152756
## 5       5       500.      289.         7004.   7002647328

4. Clustering

This scatter plot groups trades into five clusters based on their price and quantity characteristics using k-means clustering. Each cluster is color-coded, allowing a visual distinction between different trade behaviors. The spread of points across the axes helps identify whether certain groups of trades follow similar pricing and volume patterns. If the clusters are well-separated, it suggests that there are distinct trading strategies or market conditions influencing trade execution. Conversely, if clusters overlap significantly, it may indicate homogeneity in trade characteristics, reducing the effectiveness of clustering for predictive purposes.

## 
## 11. Price Volatility:

##    Price Volatility (Std Dev): 288.71

##    Daily Price Range:

##       Date              min_price          max_price          range       
##  Min.   :2024-01-01   Min.   :0.000442   Min.   : 999.4   Min.   : 999.4  
##  1st Qu.:2024-04-01   1st Qu.:0.021511   1st Qu.: 999.9   1st Qu.: 999.8  
##  Median :2024-07-01   Median :0.049344   Median :1000.0   Median : 999.9  
##  Mean   :2024-07-01   Mean   :0.073154   Mean   : 999.9   Mean   : 999.9  
##  3rd Qu.:2024-09-30   3rd Qu.:0.106503   3rd Qu.:1000.0   3rd Qu.: 999.9  
##  Max.   :2024-12-31   Max.   :0.500655   Max.   :1000.0   Max.   :1000.0

## 
## 12. Time-Based Patterns:

##    Hourly Trade Activity:

## # A tibble: 24 × 2
##     Hour  count
##    <int>  <int>
##  1     0 207322
##  2     1 207740
##  3     2 208567
##  4     3 207851
##  5     4 210049
##  6     5 208976
##  7     6 208484
##  8     7 207091
##  9     8 208499
## 10     9 208170
## # ℹ 14 more rows

##    Daily Trade Activity:

## # A tibble: 7 × 2
##   DayOfWeek  count
##   <ord>      <int>
## 1 Sun       711528
## 2 Mon       724915
## 3 Tue       721194
## 4 Wed       710273
## 5 Thu       709544
## 6 Fri       711532
## 7 Sat       711014

## 
## 13. Outlier Detection:

##    Number of Price Outliers: 0

##    Price Outliers:

## [1] Timestamp Price     Quantity 
## <0 rows> (or 0-length row.names)

5. Daily Price

This time series plot illustrates the variation in daily average prices over the observation period. The x-axis represents the timeline, while the y-axis shows the average price recorded each day. The smoothness or volatility of this line helps in identifying market stability or turbulence. Large fluctuations may indicate speculative trading or market inefficiencies, whereas a relatively stable price trend suggests an equilibrium state. Additionally, any abrupt changes in price could signal external influences such as macroeconomic factors or simulated market interventions.

## [1] 1

## [1] 1

## 
## Seasonality Decomposition:

## 
## STL Decomposition (Robust):

## Don't know how to automatically pick scale for object of type <ts>. Defaulting
## to continuous.

## Don't know how to automatically pick scale for object of type <ts>. Defaulting
## to continuous.

6. Seasonality Decomposition Plot

The decomposition plot breaks down the time series of daily prices into three components: trend, seasonality, and residual noise. The trend component captures the long-term movement in prices, revealing whether the market is experiencing growth or decline. The seasonal component isolates repetitive patterns within the data, helping to determine whether certain days, weeks, or months exhibit consistent price behaviors. Lastly, the residual component accounts for unexplained fluctuations, indicating potential anomalies or unpredictable volatility. This decomposition is critical for constructing models that separate underlying market behaviors from short-term noise.

## 
## ARIMA Model for Time Series Forecasting:

## 
##  Fitting models using approximations to speed things up...
## 
##  ARIMA(2,0,2)            with non-zero mean : 1725.567
##  ARIMA(0,0,0)            with non-zero mean : 1720.805
##  ARIMA(1,0,0)            with non-zero mean : 1723.36
##  ARIMA(0,0,1)            with non-zero mean : 1722.722
##  ARIMA(0,0,0)            with zero mean     : 5589.775
##  ARIMA(1,0,1)            with non-zero mean : 1721.98
## 
##  Now re-fitting the best model(s) without approximations...
## 
##  ARIMA(0,0,0)            with non-zero mean : 1720.805
## 
##  Best model: ARIMA(0,0,0)            with non-zero mean

## Auto ARIMA Summary:

## Series: daily_prices_ts 
## ARIMA(0,0,0) with non-zero mean 
## 
## Coefficients:
##           mean
##       499.9992
## s.e.    0.1320
## 
## sigma^2 = 6.395:  log likelihood = -858.39
## AIC=1720.77   AICc=1720.8   BIC=1728.58
## 
## Training set error measures:
##                        ME     RMSE      MAE          MPE      MAPE      MASE
## Training set 3.787996e-13 2.525346 1.993829 -0.002549461 0.3987169 0.6891452
##                     ACF1
## Training set -0.01810237

## Detailed ARIMA Fit Summary:

## 
## Call:
## arima(x = daily_prices_ts, order = best_order)
## 
## Coefficients:
##       intercept
##        499.9992
## s.e.     0.1320
## 
## sigma^2 estimated as 6.377:  log likelihood = -858.39,  aic = 1720.77
## 
## Training set error measures:
##                        ME     RMSE      MAE          MPE      MAPE      MASE
## Training set 3.787996e-13 2.525346 1.993829 -0.002549461 0.3987169 0.6932019
##                     ACF1
## Training set -0.01810237

## Warning in adf.test(model$residuals): p-value smaller than printed p-value

## Warning in kpss.test(model$residuals): p-value greater than printed p-value

## 
## ADF Test for Stationarity:

##    ADF Statistic: -6.5512 , p-value: 0.01

## KPSS Test for Stationarity:

##    KPSS Statistic: 0.0654 , p-value: 0.1

## 
## Forecasted Prices for the Next 30 Days:

## Time Series:
## Start = c(13, 7) 
## End = c(14, 6) 
## Frequency = 30 
##  [1] 499.9992 499.9992 499.9992 499.9992 499.9992 499.9992 499.9992 499.9992
##  [9] 499.9992 499.9992 499.9992 499.9992 499.9992 499.9992 499.9992 499.9992
## [17] 499.9992 499.9992 499.9992 499.9992 499.9992 499.9992 499.9992 499.9992
## [25] 499.9992 499.9992 499.9992 499.9992 499.9992 499.9992

7. ARIMA Model Forecast Plot

This plot represents the 30-day forecast generated by the ARIMA model, with a shaded confidence interval indicating the uncertainty in predictions. The x-axis corresponds to future time periods, while the y-axis represents predicted price values. If the forecast line remains constant, it implies that the model has detected no significant trend in the data, potentially due to a stationary series. However, an upward or downward projection could indicate the expected direction of price movement. The confidence interval is particularly important, as a narrow range suggests higher predictive confidence, whereas a wider range indicates uncertainty. The autocorrelation function (ACF) and partial autocorrelation function (PACF) plots help assess the residuals from the ARIMA model. The ACF plot displays how past values influence future values, providing insight into lag dependencies. If significant lags persist, it may indicate an under-fitted model, requiring additional autoregressive or moving average components. The PACF plot isolates direct relationships at different lag intervals, helping refine model selection. If these plots show a rapid decline in correlation, it suggests the data follows a weak dependence structure, validating the choice of a simpler ARIMA model.

## 
## EGARCH Model for Volatility Modeling:

## EGARCH fit converged successfully with default solver.
##             Estimate   Std. Error    t value     Pr(>|t|)
## mu      0.0001343449 4.240529e-04  0.3168116 7.513866e-01
## omega  -7.7644951993 1.865139e+00 -4.1629582 3.141509e-05
## alpha1  0.0686746021 1.176890e-01  0.5835260 5.595393e-01
## beta1   0.2192437208 1.873413e-01  1.1702905 2.418841e-01
## gamma1  0.6441725632 1.351638e-01  4.7658670 1.880433e-06
## skew    1.0517108598 1.070898e-01  9.8208345 0.000000e+00
## shape  33.9484334794 4.062575e+01  0.8356384 4.033584e-01

## 
## Forecasted Volatility for the Next 30 Days:

##      2024-12-31 23:59:54
## T+1          0.008669072
## T+2          0.007275699
## T+3          0.007001494
## T+4          0.006942772
## T+5          0.006929963
## T+6          0.006927158
## T+7          0.006926543
## T+8          0.006926409
## T+9          0.006926379
## T+10         0.006926373
## T+11         0.006926371
## T+12         0.006926371
## T+13         0.006926371
## T+14         0.006926371
## T+15         0.006926371
## T+16         0.006926371
## T+17         0.006926371
## T+18         0.006926371
## T+19         0.006926371
## T+20         0.006926371
## T+21         0.006926371
## T+22         0.006926371
## T+23         0.006926371
## T+24         0.006926371
## T+25         0.006926371
## T+26         0.006926371
## T+27         0.006926371
## T+28         0.006926371
## T+29         0.006926371
## T+30         0.006926371

# 8. Volatility Forecast Using EGARCH This plot visualizes the predicted volatility for the next 30 days based on the EGARCH model. The x-axis represents time, while the y-axis measures the expected volatility. A declining trend indicates reduced market uncertainty, whereas an upward trend suggests increasing risk. The EGARCH model captures asymmetry in volatility, meaning it accounts for situations where negative price shocks have a larger effect than positive ones. If the forecasted volatility stabilizes over time, it implies that market fluctuations are expected to settle within a predictable range.

## 
## HAR-RV Model Results:

## 
## Call:
## lm(formula = RV ~ RV_1 + RV_5 + RV_22, data = har_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1993.91  -511.47    -7.25   546.59  1971.25 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  5.866e+04  1.162e+04   5.049  7.8e-07 ***
## RV_1         4.004e-02  6.405e-02   0.625   0.5323    
## RV_5        -2.927e-02  1.543e-01  -0.190   0.8497    
## RV_22       -1.149e+00  4.526e-01  -2.538   0.0117 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 750 on 294 degrees of freedom
## Multiple R-squared:  0.02492,    Adjusted R-squared:  0.01497 
## F-statistic: 2.504 on 3 and 294 DF,  p-value: 0.05935

## 
## Volatility Model Comparison:

## EGARCH: AIC= NA , BIC= NA

## FIGARCH: AIC= NA , BIC= NA

## HAR-RV: AIC= 4797.2 , BIC= 4815.7

9. Price Distribution Histogram

This histogram shows the distribution of trade prices across all transactions. The x-axis represents price bins, while the y-axis indicates the frequency of trades executed at each price level. If the distribution is centered around a particular value (e.g., 500), it suggests a mean-reverting behavior, which could indicate market efficiency. However, if the distribution is skewed or exhibits heavy tails, it implies price asymmetry or speculative trading behavior. This histogram is useful for understanding whether price movements follow a normal distribution or exhibit signs of abnormal deviations.

10. Forecasted Volatility from the HAR-RV Model

This plot presents the volatility forecast over a set horizon using the HAR-RV model. The x-axis represents the time horizon, while the y-axis displays the expected realized volatility. If the HAR-RV model predicts stable volatility, it suggests that past returns contain limited information about future uncertainty. However, if volatility predictions exhibit fluctuations, it implies the presence of clustering effects, where high-volatility periods are followed by sustained turbulence. This forecast helps assess whether markets are expected to remain calm or experience erratic price swings.

11. Volatility Model Comparison (AIC/BIC Scores)

This table-like visualization compares different volatility models (EGARCH, FIGARCH, and HAR-RV) based on their Akaike Information Criterion (AIC) and Bayesian Information Criterion (BIC). A lower AIC/BIC score indicates a better model fit. If one model consistently has lower values than the others, it suggests a superior ability to capture volatility patterns. If multiple models yield similar scores, it implies that they explain the data equally well, making model selection a matter of preference or additional robustness checks.

Results

The synthetic market dataset revealed a highly active trading environment characterized by an average of 13,661 trades per day and 416,667 trades per month, with consecutive trades occurring every 0.11 minutes (6.6 seconds). Aggregating these trades into 30-minute intervals produced 17,568 bins, collectively representing 5 million simulated transactions. Despite the constrained price bounds (0–1000), the dataset exhibited significant price volatility, with a standard deviation of 288.71. Trading activity proved profitable overall, generating a net profit of 301,181,211 units. Risk metrics highlighted manageable downside exposure: Value at Risk (VaR) at 95% confidence stood at -1.21%, while Conditional Value at Risk (CVaR) for the worst 5% of scenarios averaged -1.49%. Time series modeling using ARIMA(0,0,0) confirmed price stability, with daily prices clustering tightly around the synthetic mean of 500. Volatility forecasting via EGARCH identified slight asymmetry (γ1 = 0.644), with forecasts stabilizing at 0.0069 over a 30-day horizon.

Clustering analysis partitioned trades into five distinct groups differentiated by transaction volume rather than price, as all clusters shared identical mean prices (500) and volatility (std = 289). Cluster 4 dominated in scale, averaging 9,003 units per trade (total 8.98 billion units). Temporal patterns showed minimal variability, with hourly trade counts ranging narrowly between 20,000–21,000 and weekly activity fluctuating modestly (709,000–724,000 trades/weekday), reflecting the absence of real-world seasonality. Daily price ranges remained artificially stable, averaging 0.073 (min) and 999.9 (max), while outlier detection and STL decomposition confirmed no anomalous prices or meaningful trend/seasonal components. These findings underscore the dataset’s utility for controlled strategy testing—high-frequency activity and stable mean prices enabled clear volatility and risk assessments—while also exposing limitations, such as unrealistic price bounds and the exclusion of external market influences.

References

Andersen, T. G., Bollerslev, T., Diebold, F. X., & Labys, P. (2003). Modeling and Forecasting Realized Volatility. Econometrica, 71(2), 579–625. https://doi.org/10.1111/1468-0262.00418 Reference for HAR-RV (Heterogeneous Autoregressive Realized Volatility) model.
Baillie, R. T., Bollerslev, T., & Mikkelsen, H. O. (1996). Fractionally Integrated Generalized Autoregressive Conditional Heteroskedasticity. Journal of Econometrics, 74(1), 3–30. https://doi.org/10.1016/0304-4076(95)01749-6 Reference for FIGARCH (Fractionally Integrated GARCH) model.
Box, G. E. P., Jenkins, G. M., & Reinsel, G. C. (2015). Time Series Analysis: Forecasting and Control (5th ed.). Wiley. Foundational reference for ARIMA modeling.
Cleveland, R. B., Cleveland, W. S., McRae, J. E., & Terpenning, I. (1990). STL: A Seasonal-Trend Decomposition Procedure Based on Loess. Journal of Official Statistics, 6(1), 3–73. Reference for STL (Seasonal-Trend Decomposition using LOESS) method.
Hartigan, J. A., & Wong, M. A. (1979). Algorithm AS 136: A K-Means Clustering Algorithm. Journal of the Royal Statistical Society. Series C (Applied Statistics), 28(1), 100–108. https://doi.org/10.2307/2346830 Reference for k-means clustering.
Jorion, P. (2006). Value at Risk: The New Benchmark for Managing Financial Risk (3rd ed.). McGraw-Hill. Reference for Value at Risk (VaR) and risk management concepts.
Nelson, D. B. (1991). Conditional Heteroskedasticity in Asset Returns: A New Approach. Econometrica, 59(2), 347–370. https://doi.org/10.2307/2938260 Seminal paper on the EGARCH (Exponential GARCH) model.
Rockafellar, R. T., & Uryasev, S. (2000). Optimization of Conditional Value-at-Risk. Journal of Risk, 2(3), 21–42. https://doi.org/10.21314/JOR.2000.038 Reference for Conditional Value at Risk (CVaR).
Tsay, R. S. (2005). Analysis of Financial Time Series (2nd ed.). Wiley. Comprehensive resource for volatility modeling, GARCH variants, and outlier detection.
Hyndman, R. J., & Athanasopoulos, G. (2021). Forecasting: Principles and Practice (3rd ed.). OTexts. https://otexts.com/fpp3/ Practical guide for time series forecasting, including ARIMA and diagnostics (e.g., ADF/KPSS tests).
R Core Team. (2023). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing. https://www.R-project.org/ Reference for R packages and functions (e.g., xts, rugarch, forecast) used in the analysis.

Synthetic Market Dynamics: A Predictive Time Series Odyssey

2025-02-21