Statistical Data Analysis
Converting raw particle detector data into energy spectra using Python.
My original report: https://docs.google.com/document/d/1P-tWzdlIKxQNC310kIgAdcKr3lf9ZwuVwtXY2MCfuls/edit?usp=sharing
Introduction
This portfolio aims to give an example of how Python can be used for statistical analysis. The acquired data from the particle detector is expected to have noise due to the background event, and converting this raw data into energy spectra is crucial for data analysis. To do this, the calibration data with a known source is used.
Terminology
● Calibration data: Data used for the reverse process of regression.
● Max minus min: Subtraction between max value and min value of calibration data.
● Regression: For non-rigorous explanation, it is a mathematical method to find the function that fits given data.
● Chi-squared(χ²) fit: Mainly known as 'Goodness of fit.' In regression analysis, it is used for testing statistical hypotheses.
● μ: Mean; average.
● σ: Standard deviation
● A: Amplitude
Analysis
In addressing the inherent uncertainty in histograms, the square root provides a reasonable estimate because the background event is random. It is not uncommon for analysts to encounter challenges when fitting curves to histograms: while the extremities of the distribution may align well with the model, discrepancies often occur at the peak, leading to inflated values of chi-squared (χ²) and diminished fit probabilities, frequently approaching 0. This issue primarily arises from elevated uncertainties near the peak; since curve-fitting algorithms (such as scipy.optimize.curve_fit) leverage square root calculations for uncertainty assessments, they inherently assign lower weighting to central data points.
To mitigate this problem and enhance curve fitting accuracy, one might consider strategies such as enlarging bin sizes or constraining the fit region to encompass mainly the distribution center. Larger bins effectively reduce the count per bin, which in turn diminishes each bin's associated uncertainty, thereby prompting algorithms to assign greater significance to central regions over peripheral ones. Furthermore, given that histograms typically exhibit narrow Gaussian distributions around their peaks, restricting fit parameters specifically around this central area can lead to more accurate modeling results.
Upon implementing the max minus min energy estimator and its associated calibration factor, a histogram resembling the one depicted in the left figure is obtained. Given that this histogram features multiple peaks, its functional form is best described by a summation of several Gaussian distributions:
With n = 4 of peaks, the curve fit gives the values of: