Data screening
- Data screening should be conducted prior to data analysis, to help ensure the integrity of the data.
- Data screening means checking data for errors and fixing or removing these errors. Try to find out as much as possible about problematic data, then make a decision which maximises "signal" and minimises "noise".
- Keep a record of data screening steps undertaken and any changes made to the data. This should be summarised in the Results.
- Check through the data file (Case by case and variable by variable) looking for and addressing any oddities, such as:
- Out-of-range values:
- Check for out-of-range values (e.g., by obtaining descriptive statistics (in SPSS, use Analyze - Descriptive Statistics - Descriptives) to examine the minimum and maximum values for all variables of interest).
- If out-of-range values are identified, decide whether to accept, replace, or remove these values.
- In SPSS, in the Data View, cases can be sorted by variables with out-of-range values in order to easily identify the case(s) which has(have) the out-of-range values near the top and/or bottom of the data file. Alternatively, a search and find could be used to identify the cell(s) which contain(s) the out-of-range values which have been identified.
- Duplicate cases:
- Duplicate cases occur when two or more cases have identical or near-identical data
- Check for duplicate cases either manually or in SPSS via Data - Identify Duplicate cases - Enter some variables.
- If duplicate cases are identified, consider whether to remove all duplicate cases (e.g., the integrity of the data may be in doubt because it may have been fabricated and then duplicated?) or to retain one copy of each case and to delete duplicates
- Empty cases: e.g., cases with no or little data could be removed
- Cases with responses which lack meaningful variation: (e.g., 5 5 5 5 5 5 5 5 5) or which exhibit obvious arbitrary patterns (e.g., zig-zag - 1 2 3 4 5 4 3 2 1 2 3 4 5) - such responses are unlikely to be valid and probably should be deleted.
- Out-of-range values:
- To change data in a cell, left-click on the cell. Delete the data to make it missing data (sysmis). Or change the value in the cell by typing in the new value.
- For cases with a lot of erroneous data, it is probably best to remove the entire case - i.e., delete the whole row.
- For cases with some erroneous data, it is probably best to make the erroneous data missing unless the correct value is obvious (e.g., if 77 was entered, it might reasonably be deduced that 7 was intended) in which case the incorrect value can be replaced with a best guess correct value. Erroneous data can be changed to missing data. Alternatively, if a correct value can be presumed, then this can be entered.
- Consider whether reverse coding and/or recoding of data might be necessary or appropriate.
- Where hard-copy surveys have been entered into electronic format:
- Check the data entry accuracy (e.g., if the data has been hand-entered from hard copy surveys, then double-check the electronic data against the electronic data)
- Check missing data (e.g., using descriptives or missing data analysis) and fixing if possible (e.g., check against hard copy survey). Consider possible replacement of missing data (e.g. mean replacement or regression-based replacement).
See also
External links
- Data screening (statwiki.kolobcreations.com)
This article is issued from Wikiversity - version of the Thursday, April 09, 2015. The text is available under the Creative Commons Attribution/Share Alike but additional terms may apply for the media files.