Data Reduction (archival)

13. Data Reduction (archival)#

This tutorial will guide you through the process of finding, downloading, and reducing data from the Mikulski Archive for Space Telescopes (MAST) to create science-ready light curves. We’ll primarily use the lightkurve Python package, a powerful tool for working with space telescope data like TESS, Kepler, and K2.

Goal: By the end of this tutorial, you’ll be able to search for space telescope data, download it, perform basic reduction steps (like cleaning and normalization), and generate a light curve ready for analysis (e.g., searching for exoplanet transits or stellar variability).

Environment: This notebook is designed to work well in online Python IDEs like Google Colab or MS Visual Studio Code with a Python kernel.

13.1. Introduction to MAST and Lightkurve#

13.1.1. What is MAST?#

The Mikulski Archive for Space Telescopes (MAST) is NASA’s primary archive for optical and ultraviolet astronomical data, with a strong focus on exoplanet-finding missions. It hosts data from missions like:

TESS (Transiting Exoplanet Survey Satellite): Currently surveying most of the sky for nearby transiting exoplanets.
Kepler/K2: Famous for discovering thousands of exoplanets using the transit method.
Hubble Space Telescope (HST), GALEX, and others.

You can explore MAST visually via its portal: https://mast.stsci.edu

13.1.2. What is Lightkurve?#

Lightkurve is an open-source Python package designed to simplify the analysis of time-series astronomical data, especially from NASA’s TESS, Kepler, and K2 missions. It provides tools for searching, downloading, processing, and plotting light curves and target pixel files (TPFs).

Official documentation: https://docs.lightkurve.org

13.2. Setting up Your Python Environment#

To use lightkurve, you need Python and several astronomy/scientific packages, including astropy, numpy, and matplotlib. If you are using Google Colab, install lightkurve inside a notebook cell with:

%pip install lightkurve

If you are working from a terminal or a local Anaconda prompt, use:

pip install lightkurve

Recommended local environment

If import lightkurve fails with an autograd, oktopus, or numpy.linalg error, you are probably using a package combination that is incompatible with NumPy 2.x. A clean conda environment with NumPy pinned below 2 is usually the easiest fix:

conda create -n lightkurve-env -c conda-forge python=3.11 "numpy<2" lightkurve astropy astroquery matplotlib scipy pandas jupyterlab ipykernel
conda activate lightkurve-env
python -m ipykernel install --user --name lightkurve-env --display-name "Python (lightkurve-env)"

Now, let’s import lightkurve and other useful packages.

import lightkurve as lk
import numpy as np
import matplotlib.pyplot as plt

%matplotlib inline

print(f"Lightkurve version: {lk.__version__}")
print(f"NumPy version: {np.__version__}")

Lightkurve version: 2.5.1
NumPy version: 2.4.4

13.3. Searching for Data on MAST#

lightkurve allows you to search for data programmatically. You can search by target name, mission-specific ID, or coordinates.

For the first tutorial example, we will use WASP-18, a bright star with a short-period transiting planet. This is a better first target than a long-period planet because one TESS sector contains many repeated transits. A short-period target makes the phase-folded light curve much easier to see when we intentionally limit the download to one sector/quarter.

Why not start with TOI 700 d?

TOI 700 d is scientifically interesting, but it is not a good first-sector-only tutorial target. Its orbital period is long enough that one TESS sector may contain too few useful transits for a clear folded light curve. Start with a short-period planet first, then move to longer-period targets once the workflow is familiar.

target_star = "WASP-18"
mission = "TESS"
author = "SPOC"

# Other possible examples after you understand the workflow:
# target_star = "TOI 700"       # longer-period multi-planet system; usually needs more sectors
# target_star = "KIC 8462852"   # Kepler/Boyajian's Star example
# target_star = "TIC 150428135" # TIC identifier for TOI 700

search_result = lk.search_lightcurve(target_star, mission=mission, author=author)
search_result

SearchResult containing 10 data products.

#	mission	year	author	exptime	target_name	distance
				s		arcsec
0	TESS Sector 02	2018	SPOC	120	100100827	0.0
1	TESS Sector 03	2018	SPOC	120	100100827	0.0
2	TESS Sector 29	2020	SPOC	20	100100827	0.0
3	TESS Sector 30	2020	SPOC	20	100100827	0.0
4	TESS Sector 29	2020	SPOC	120	100100827	0.0
5	TESS Sector 30	2020	SPOC	120	100100827	0.0
6	TESS Sector 69	2023	SPOC	20	100100827	0.0
7	TESS Sector 69	2023	SPOC	120	100100827	0.0
8	TESS Sector 96	2025	SPOC	20	100100827	0.0
9	TESS Sector 96	2025	SPOC	120	100100827	0.0

The search_result object is a table listing available light curve products for the target. It includes information such as the mission, sector/quarter/campaign, author or pipeline, exposure time, and observation ID.

For this tutorial, we intentionally keep only the first available sector. This keeps the page faster to build and gives students a smaller dataset to understand before they work with multi-sector light curves.

search_tess = search_result

if len(search_tess) == 0:
    raise RuntimeError(f"No {mission} {author} light curves were found for {target_star}.")

print(f"Found {len(search_tess)} {mission} {author} light curve products for {target_star}.")
print("For this tutorial, we will use only the first available product:")
search_tess[:1]

Found 10 TESS SPOC light curve products for WASP-18.
For this tutorial, we will use only the first available product:

SearchResult containing 1 data products.

#	mission	year	author	exptime	target_name	distance
				s		arcsec
0	TESS Sector 02	2018	SPOC	120	100100827	0.0

You can also search for Target Pixel Files (TPFs), which contain the raw pixel data around your target. This is useful if you want to perform custom photometry.

search_tpf_result = lk.search_targetpixelfile(target_star, mission=mission, author=author)

if len(search_tpf_result) == 0:
    print(f"No {mission} {author} target pixel files were found for {target_star}.")
else:
    print(f"Found {len(search_tpf_result)} target pixel file products. We will use only the first one later.")

search_tpf_result[:1]

Found 10 target pixel file products. We will use only the first one later.

SearchResult containing 1 data products.

#	mission	year	author	exptime	target_name	distance
				s		arcsec
0	TESS Sector 02	2018	SPOC	120	100100827	0.0

13.4. Understanding Data Products and Downloading Data#

MAST provides several types of data products:

Light Curve Files (LCFs): These files contain pre-processed light curves. For TESS and Kepler, common flux types include:
- SAP_FLUX (Simple Aperture Photometry): the summed flux inside a pipeline aperture. It can still contain instrumental systematics.
- PDCSAP_FLUX (Pre-search Data Conditioning SAP Flux): SAP flux corrected for many instrumental systematics. This is usually the best starting point for a first transit or variability search.
Target Pixel Files (TPFs): These files contain the time-series images centered on the target star. They are useful when you want to inspect contamination, apertures, or extract your own light curve.

For speed and clarity, we will download only the first available light curve product. In a real research project, you may later download many sectors or quarters and stitch them together.

def convert_masked_flux_to_nan(lc):
    """Convert masked flux-like columns to ordinary arrays with NaNs.

    Some TESS products can store flux columns as masked quantities. That can cause
    errors later when cleaning, normalizing, or stitching. This helper replaces
    masked values with NaN values, which Lightkurve can remove cleanly.
    """
    lc = lc.copy()

    for colname in ["flux", "flux_err", "sap_flux", "sap_flux_err", "pdcsap_flux", "pdcsap_flux_err"]:
        if colname in lc.colnames:
            quantity = lc[colname]
            unit = getattr(quantity, "unit", None)

            try:
                values = quantity.filled(np.nan).value
            except AttributeError:
                values = quantity.value

            values = np.asarray(values, dtype=float)

            if unit is not None:
                lc[colname] = values * unit
            else:
                lc[colname] = values

    return lc

# Download only the first available light curve product.
# The flux_column argument asks Lightkurve to use PDCSAP flux when available.
lc0 = search_tess[0].download(flux_column="pdcsap_flux", quality_bitmask="default")
lc0 = convert_masked_flux_to_nan(lc0)

# Keep a one-item collection so the rest of the notebook still uses collection-style logic.
lc_collection = lk.LightCurveCollection([lc0])
lc_collection

LightCurveCollection of 1 objects:
    0: <TessLightCurve LABEL="TIC 100100827" SECTOR=2 AUTHOR=SPOC FLUX_ORIGIN=pdcsap_flux>

lc_collection is a LightCurveCollection object. In this tutorial, it contains only one light curve because we intentionally downloaded only the first available sector. Keeping the collection structure makes it easier to extend this notebook later to multiple sectors or quarters.

if len(lc_collection) > 0:
    lc0 = lc_collection[0]
    print(lc0)
    lc0.plot(label="First downloaded light curve")
else:
    print(f"No light curves found for {target_star}.")

       time             flux      ...   pos_corr1      pos_corr2   
                    electron / s  ...      pix            pix      
------------------ -------------- ... -------------- --------------
1115072617192     47745.6875 ... -1.2000147e-01  7.8144908e-02
1128961919705 47782.17578125 ... -1.0294508e-01  6.7404680e-02
1142851224547 47737.03515625 ... -1.0323820e-01  6.3302755e-02
117062983188 47739.65234375 ... -1.0774896e-01  5.0843615e-02
1184519134392   47759.828125 ... -9.3161479e-02  8.3432972e-02
1198408439234  47727.6953125 ... -9.2094876e-02  6.8144821e-02
1212297741743  47743.8203125 ... -9.1556683e-02  6.7030065e-02
1226187044247  47756.1640625 ... -9.4503127e-02  5.9139974e-02
1240076349084 47734.61328125 ... -9.1448054e-02  6.1672445e-02
               ...            ... ...            ...            ...
5049589636317  47716.5546875 ...  1.2310899e-02  6.2099039e-03
5063478540658        47688.5 ...  1.8280700e-02  4.9027326e-03
5077367444994 47734.99609375 ...  1.4953445e-02  7.2186007e-03
5091256349335 47757.33984375 ...  1.8104736e-02  1.0969753e-02
510514525368   47724.890625 ...  1.2774169e-02  5.1160059e-03
5119034158026  47665.3046875 ...  9.1379741e-03  1.9494935e-03
5132923062372 47754.12890625 ...  1.4056482e-02 -2.0336423e-03
5146811966717 47737.07421875 ...  1.8469937e-02  8.3355261e-03
5160700871054   47736.796875 ...  1.8965174e-02  3.1502221e-03
51745897754 47728.08203125 ...  1.6774995e-02  7.3410203e-03
Length = 18299 rows

../_images/ee5fec0929e7a2922dfa86f9fa20c3664e28455e6f0e48d86e96785296b7b726.png

The plot shows time on the x-axis and flux (usually PDCSAP_FLUX by default for lk.LightCurve objects from SPOC) on the y-axis. You can specify which flux column to plot:

if len(lc_collection) > 0:
    lc0.plot(label="PDCSAP flux")
else:
    print("No light curve is available to plot.")

../_images/74ef1448307288f6a494207604e75468ab240bd5edc8bc6fac5af4a8bdb02136.png

13.5. Working with Light Curve Files (LCFs)#

A LightCurve object has several useful attributes and methods.

lc.time: time values, often in BJD or a mission-specific barycentric time system.
lc.flux: flux values.
lc.flux_err: uncertainty estimates on flux values.
lc.quality: quality flags indicating potential issues with data points.

13.5.1. Single-sector workflow#

A common research workflow combines many sectors or quarters. For this tutorial, we do the opposite: we keep one sector so the notebook runs quickly and the plots are easier to interpret. That means we do not need to stitch multiple light curves here.

If you later use multiple sectors, clean and normalize each light curve first, then call stitch(corrector_func=lambda lc: lc) to avoid repeated normalization or masked-array errors.

if len(lc_collection) > 0:
    # Clean and normalize the first sector only.
    stitched_lc = lc_collection[0].remove_nans().remove_outliers(sigma=5).normalize()
    stitched_lc.plot(label="First sector: cleaned and normalized")
else:
    stitched_lc = None
    print("No light curve is available to clean.")

../_images/3866eec903a50685b472194d47137bdddc2d688cf31bc975886978cdec1c0594.png

13.5.2. Data Cleaning with Quality Flags#

Space telescope data often have quality flags indicating issues like cosmic ray hits, spacecraft manoeuvres, or bad pixels. lightkurve attempts to handle standard quality masking when loading data (especially for PDCSAP_FLUX which often has NaNs where corrections failed).

We’ve already used remove_nans() which is crucial for PDCSAP_FLUX. remove_outliers() was also applied above. These are good first steps in cleaning.

if 'stitched_lc' in locals() and stitched_lc is not None:
    cleaned_lc = stitched_lc
    print(f"Points in the cleaned single-sector light curve: {len(cleaned_lc.flux)}")
    cleaned_lc.plot(label="Cleaned single-sector light curve")
else:
    cleaned_lc = None
    print("Cleaned light curve is not available.")

Points in the cleaned single-sector light curve: 18299

../_images/ed927695a2fb898964fd9850887d7ae35b9828c66bbf9cf70f333aff13abeb44.png

13.5.3. Normalization and Detrending#

Normalization typically means dividing the flux by a representative baseline value so the out-of-transit flux is near 1.0. This makes transit depths easier to read as fractional changes in brightness.

Detrending removes slow variations from instrumental systematics or stellar variability. For a first pass, lightkurve.flatten() is a convenient option. It applies a Savitzky-Golay filter to estimate and remove low-frequency trends.

The window_length value is measured in cadences, not days. It should be long compared with the transit duration so the filter does not remove the transit itself.

if 'cleaned_lc' in locals() and cleaned_lc is not None:
    try:
        # This is a simple first-pass detrending choice for a short-period hot Jupiter target.
        # For a science analysis, test several window lengths and verify that the transit depth is not distorted.
        flat_lc = cleaned_lc.flatten(window_length=401, polyorder=2)
        flat_lc.plot(label="Flattened single-sector light curve")
    except Exception as err:
        print(f"Flattening failed: {err}")
        print("Using the cleaned but unflattened light curve for the remaining examples.")
        flat_lc = cleaned_lc
        flat_lc.plot(label="Cleaned light curve")
else:
    flat_lc = None
    print("Cleaned light curve is not available for detrending.")

../_images/1b0c2440ec4fca8205fe3f8a779900084399a2072d4aa52585721059193b02a2.png

13.6. Working with Target Pixel Files (TPFs) - Custom Photometry#

If pre-processed light curves are not suitable, you can inspect the target pixel file and create your own light curve. This can matter if the target is blended with nearby stars, if the pipeline aperture is poor, or if you need to understand where the measured flux is coming from.

For this tutorial, we again use only the first available TPF so the notebook remains quick to run.

if len(search_tpf_result) > 0:
    print("Downloading the first available TPF only.")
    tpf = search_tpf_result[0].download(quality_bitmask="default")
    tpf.plot(frame=0)
else:
    tpf = None
    print(f"No TPFs found for {target_star} from {mission} {author}.")

Downloading the first available TPF only.

../_images/601ff4c63f3ab54100bf9c6c0494b50d91120ab8b62ad35570bda23b51442ebb.png

The plot shows the pixel data. The red outline is the default pipeline aperture. You can define your own aperture interactively (in a Jupyter Notebook session by enabling tpf.interact()) or by creating a boolean mask programmatically.

13.6.1. Creating a Custom Aperture Mask#

Let’s try to define a simple circular aperture or use a threshold mask. For this example, we’ll use the pipeline’s default aperture, which is often a good choice.

if 'tpf' in locals() and tpf is not None:
    # Using the pipeline-defined aperture mask
    pipeline_aperture = tpf.pipeline_mask
    tpf.plot(aperture_mask=pipeline_aperture, mask_color='red');
    print(f"Pipeline aperture mask sum (pixels): {pipeline_aperture.sum()}")

    # To create a custom threshold mask (alternative):
    # custom_aperture = tpf.create_threshold_mask(threshold=5) # threshold in sigma above median
    # tpf.plot(aperture_mask=custom_aperture, mask_color='blue');
    # print(f"Custom threshold aperture mask sum (pixels): {custom_aperture.sum()}")
else:
    print("TPF not available for aperture definition.")

Pipeline aperture mask sum (pixels): 28

../_images/86383f9be565f84f5e2717495bd24c4c4d00d6c1918205bde168cdde6fecdb8d.png

13.6.2. Performing Aperture Photometry#

Once you have an aperture mask, you can extract the light curve using the to_lightcurve() method. This method will sum the flux within the aperture for each cadence.

if 'tpf' in locals() and tpf is not None and 'pipeline_aperture' in locals() and pipeline_aperture.sum() > 0:
    custom_lc_from_tpf = tpf.to_lightcurve(aperture_mask=pipeline_aperture)
    custom_lc_from_tpf.plot(label="Light Curve from TPF (Pipeline Aperture)");
elif 'tpf' in locals() and hasattr(tpf, 'pipeline_mask') and tpf.pipeline_mask.sum() > 0: 
    print("Using pipeline mask as custom_mask was not effective or defined.")
    custom_lc_from_tpf = tpf.to_lightcurve(aperture_mask=tpf.pipeline_mask) # Fallback to pipeline mask
    custom_lc_from_tpf.plot(label="Pipeline Aperture Light Curve (from TPF)");
else:
    print("TPF or suitable aperture mask not available for custom photometry.")

../_images/4033e6be43c69536a9b21eb84499e069ff6fb97fac05ef9b680b8f2910448651.png

This custom light curve is essentially a SAP (Simple Aperture Photometry) light curve. It will likely contain systematics from the instrument and spacecraft. You would then need to perform systematic correction (e.g., using Regressors in lightkurve, or methods like CBV, PLD if you delve deeper). For simplicity, we’ll skip advanced custom systematic correction here and revert to using the PDCSAP flux from earlier (flat_lc) for the transit search example, as it has already undergone some systematic correction.

13.7. Example: Finding and Folding a Transit#

Now we will use the flattened light curve to find a repeating transit signal. Because the target is a short-period transiting planet system, one TESS sector should contain many transits. This makes it a better first tutorial example than a long-period planet.

Instead of hard-coding the period, we will use a Box Least Squares periodogram. BLS is designed to find box-shaped dips in a light curve, which makes it useful for transiting exoplanets.

if 'flat_lc' in locals() and flat_lc is not None and len(flat_lc.remove_nans()) > 0:
    search_lc = flat_lc.remove_nans()

    # Search for short-period transits. WASP-18 b is a short-period planet, so this range is intentionally narrow.
    period_grid = np.linspace(0.5, 2.0, 5000)
    duration_grid = np.linspace(0.02, 0.20, 20)

    bls = search_lc.to_periodogram(method="bls", period=period_grid, duration=duration_grid)

    best_period = bls.period_at_max_power
    best_t0 = bls.transit_time_at_max_power
    best_duration = bls.duration_at_max_power

    best_period_days = best_period.value
    best_t0_value = best_t0.value
    best_duration_days = best_duration.value

    print(f"The strongest BLS period is {best_period_days:.5f} d.")
    print(f"The estimated transit midpoint is {best_t0_value:.5f} {best_t0.format}.")
    print(f"The estimated transit duration is {best_duration_days:.5f} d.")

    fig, ax = plt.subplots(figsize=(7, 4), dpi=120)

    bls.plot(ax=ax)

    ax.set_xlabel("Period [d]")
    ax.set_ylabel("BLS power")
    ax.set_title(f"BLS periodogram for {safe_target_name}")

    plt.show()
else:
    folded_lc = None
    print("Flattened light curve is not available or is empty for BLS and folding.")

The strongest BLS period is 0.94139 d.
The estimated transit midpoint is 1354.45951 btjd.
The estimated transit duration is 0.07600 d.

../_images/4620f8834f04ff9e9a4673527263e33098432847b0942bde088f669d020e7549.png

The folded plot should show a repeated transit dip near phase zero. The unbinned points show the scatter in the data, while the binned curve shows the average transit shape more clearly.

First-sector limitation

A single sector is good for a tutorial because it runs quickly and is easier to inspect. It is not always enough for a final research result. Longer-period planets, weak transits, or systems with strong stellar variability often require multiple sectors or quarters.

13.8. Saving Your Reduced Light Curve#

Once you have a processed light curve that you’re happy with (e.g., flat_lc or folded_lc), you can save it to a file. Common formats include CSV or FITS.

Note for Google Colab users: Files saved this way will be stored in the temporary Colab environment. To save them permanently, you’ll need to download them to your local machine or mount your Google Drive and save them there. You can find downloaded files in the file browser panel in Colab (usually on the left sidebar) and right-click to download.

if 'flat_lc' in locals() and flat_lc is not None and len(flat_lc.remove_nans()) > 0:
    safe_target_name = target_star.replace(" ", "_")

    output_filename_csv = f"{safe_target_name}_single_sector_flat_lightcurve.csv"
    flat_lc.to_csv(output_filename_csv, overwrite=True)
    print(f"Flattened light curve saved as: {output_filename_csv}")

    output_filename_fits = f"{safe_target_name}_single_sector_flat_lightcurve.fits"
    flat_lc.write(output_filename_fits, overwrite=True)
    print(f"Flattened light curve saved as: {output_filename_fits}")

    if 'folded_lc' in locals() and folded_lc is not None:
        output_filename_folded_csv = f"{safe_target_name}_single_sector_folded_lightcurve.csv"
        folded_lc.to_csv(output_filename_folded_csv, overwrite=True)
        print(f"Folded light curve saved as: {output_filename_folded_csv}")
else:
    print("No processed light curve is available to save.")

Flattened light curve saved as: WASP-18_single_sector_flat_lightcurve.csv
Flattened light curve saved as: WASP-18_single_sector_flat_lightcurve.fits

WARNING: The unit 'electron / s' could not be saved in native FITS format and hence will be lost to non-astropy fits readers. Within astropy, the unit can roundtrip using QTable, though one has to enable the unit before reading. [astropy.io.fits.convenience]
WARNING: Meta-data keyword BITPIX will be ignored since it conflicts with a FITS reserved keyword [astropy.io.fits.convenience]
WARNING: Meta-data keyword NAXIS will be ignored since it conflicts with a FITS reserved keyword [astropy.io.fits.convenience]
WARNING: VerifyWarning: Keyword name 'FLUX_ORIGIN' is greater than 8 characters or contains characters not allowed by the FITS standard; a HIERARCH card will be created. [astropy.io.fits.card]
WARNING: VerifyWarning: Keyword name 'QUALITY_BITMASK' is greater than 8 characters or contains characters not allowed by the FITS standard; a HIERARCH card will be created. [astropy.io.fits.card]
WARNING: VerifyWarning: Keyword name 'QUALITY_MASK' is greater than 8 characters or contains characters not allowed by the FITS standard; a HIERARCH card will be created. [astropy.io.fits.card]
WARNING: Attribute `QUALITY_MASK` of type <class 'numpy.ndarray'> cannot be added to FITS Header - skipping [astropy.io.fits.convenience]
WARNING: VerifyWarning: Keyword name 'NORMALIZED' is greater than 8 characters or contains characters not allowed by the FITS standard; a HIERARCH card will be created. [astropy.io.fits.card]

Folded light curve saved as: WASP-18_single_sector_folded_lightcurve.csv

13.9. Conclusion and Further Steps#

Congratulations! You’ve learned the basics of:

Searching for data from space telescopes like TESS and Kepler using lightkurve.
Limiting a tutorial workflow to the first sector/quarter so the analysis runs quickly and remains easy to inspect.
Understanding the difference between Light Curve Files (LCFs) and Target Pixel Files (TPFs).
Downloading and plotting light curves.
Performing essential data reduction steps: cleaning, normalization, and detrending.
Extracting a custom light curve from TPF data (basic aperture photometry).
Phase-folding a light curve to look for periodic signals like exoplanet transits.
Saving your processed light curve, with considerations for online IDEs.

Further Steps for Your Research:

Advanced Detrending: Explore more sophisticated detrending techniques (e.g., Gaussian Processes, Cotrending Basis Vectors - CBVs, Pixel Level Decorrelation - PLD) if simple methods are insufficient.
Period Finding: If the period of a signal is unknown, use period-finding algorithms (e.g., Box Least Squares for transits, Lomb-Scargle periodogram for stellar oscillations) available in lightkurve or astropy.timeseries.
Transit Modeling: If you find a transit, you can model its shape to determine planetary parameters (radius, inclination, etc.) using packages like exoplanet, batman, or PyTransit.
Systematics Mitigation: Deep dive into understanding and mitigating instrumental systematics specific to the mission you are working with.
Consult Documentation: The Lightkurve documentation is an excellent resource with many tutorials and API references.